Print

Print


ok, lets elaborate a bit more.

a) use this template in my XSL script

<xsl:template match="text()">
  <xsl:choose>
    <xsl:when test="contains(.,'&lt;')">
      <xsl:result-document omit-xml-declaration="yes"
href="BLOGXTRACT-{generate-id()}.html">
    <body>
      <xsl:value-of select="." disable-output-escaping="yes"/>
    </body>
      </xsl:result-document>
      <ptr target="BLOGXTRACT-{generate-id()}.html" rend="transclude"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="."/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

which writes out any bit of text which contains a < to a
file, and adds an pointer to that file.

b) loop over every BLOCXTRACT*.html file and do
a standard HTML cleanup on it using tidy or equivalent;
that should get you well-formed XML. I use a PHP script
on the command-line looking like this:

<?
$dom = new domdocument;
$dom->formatOutput = true;
@$dom->loadHTMLFile($argv[1]);
$dom->encoding = "utf-8";
echo $dom->saveXML();
?>


c) process the file which resulted from stage a) and follow
the <ptr type="transclude"> elements with

  <xsl:template match="ptr[@type='translude']">
    <xsl:copy-of select="document(@target)//body/*"/>
 </xsl:template>

d) now do your proper transformation into target XML.
-- 
Sebastian Rahtz
Information, Oxford University Computing Services
Sólo le pido a Dios
que el futuro no me sea indiferente