LISTSERV mailing list manager LISTSERV 16.0

Help for TEI-L Archives


TEI-L Archives

TEI-L Archives


TEI-L@LISTSERV.BROWN.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TEI-L Home

TEI-L Home

TEI-L  January 2005

TEI-L January 2005

Subject:

Re: Valid HTML Considered Harmful?

From:

Michael Beddow <[log in to unmask]>

Reply-To:

Michael Beddow <[log in to unmask]>

Date:

Sat, 29 Jan 2005 21:31:51 -0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (415 lines)

I appreciate the points made in this thread about the merits of html
validity in general and valid xhtml in particular, but I for one would want
to draw the line at misrepresenting the document's structure in my XML
simply to get a better mapping to valid html. Which is what I think is
happening in this example from earlier on in the thread:

=============================
<p>As someone said at great length:</p>
<quote>
<p>One paragraph of interesting citation</p>
<p>Two paragraphs of interesting citation</p>
</quote>
<p>And so on and so forth</p>
=============================

Here, one paragraph enclosing a quotation which itself has two internal
paragraphs is misrepresented as four separate paragraphs. The unity of the
enclosing paragraph is denied, and the containment of the quoted paras is
translated into sequential contiguity.

That seems to me too high a price to pay for being able to pass the XML
straightforwardly through XSLT and get valid hmtl. Nor am I convinced that
the effort involved in re-expressing the inherent structure of the XML
document via linking and aggregation techniques is worth undertaking for the
sake of the desired html, especially if (as in my case) the html is
generated on demand simply as the delivery vehicle best suited to giving as
many people as possible a reasonably accurate  view of what the XML
contains -- which is pretty much all that can be hoped for in the present
state of technology outside environments where document producers have
complete control over the client software.

To stick with the case that launched this thread (on the understanding that
we're focussing on the blockquote issue because of the generic problems it
raises): two ways of getting an XML <p> to be split into self-contained html
<p>s which are siblings of  <blockquotes> rather than their parent, without
distorting the  XML at source, have been mentioned by earlier contributors.
Let me say a little bit more about them in the hope that it may help those
who want to generate valid html via XSLT from TEI markup see ways forward.

Actually, I'm going to say as little as possible about the first method,
mentioned with appropriate distaste by a couple of people already, namely
abusing disable-output-escaping to pass off markup as text. As others have
said, XSLT is for inputting, transforming, and outputting trees. And
outputting trees is more like wrapping presents than making sandwiches.

You can start making a sandwich by choosing one slice of bread, putting it
on a plate, smearing on some spread, then you can wander at leisure around
the kitchen adding ingredients until you decide your sandwich is finished.
Whereupon you top it with another slice of bread (needn't even be from the
same loaf as long as the size is roughly right) or even opt to leave it as
an open sandwich. And if you decide that slice of pickled gherkin probably
wasn't such a great idea after all, you can take it out again before you
take your first bite. This is what using a serial processor is like; but an
XSLT engine isn't a serial processor.

Wrapping presents is different. You have to decide who's is getting what,
then you group together the gifts intended for a given recipient and only
then reach for the wrapping paper and wrap them all up in a single
operation. In fact, you may have been doing the grouping long before you
even thought about buying the wrapping paper, reaching for the carrier bag
where you are accumulating Uncle Jim's presents whenever you encounter
something in your sock drawer that you realise you are never ever going to
wear yourself. But once you've wrapped the presents, you can't change your
mind without starting all over again, so you'd better be sure you really
have settled what the parcel is to contain before you start the wrapping.
Which is the approach you need to take when using a tree-processing system.

In many cases where XSLT output needs to be grouped in some way, the
required grouping is already explicit in the source markup, so you can grab
the contents of your intended parcels in a single operation and wrap them
straight away, without explicitly being aware of the need to group. In other
words, it's as if your task is simply re-wrapping presents someone else
already assembled and assigned to their future recipients; all you need to
do is strip of the old wrapping and put on a new one using different paper.

Which is fine as long as you encounter only such cases. But then along comes
a document where some of the things that need to be grouped together before
you can do the wrapping operation heven't already been conveniently
pre-grouped for you. Your first reaction may be to think XSLT can't cope.
But it can, provided you learn how to use it to group material according to
specifications that you provide. XSLT 2.0 provides commands to do just that;
but  XSLT 1.0 can do it too, Although the techniques needed to group in XSLT
1.0 are harder to assimilate and deploy than those in XSLT 2.0, the core of
what you need to grasp, namely the crafting of XPath expressions to select
precisely those things you want to group together, is common to both
versions of the language.

I'll try to explain this with reference to the following sample document,
consisting of an XML <p> which contains <q> children which we wish to be set
out as html <blockquotes>, meaning that to achieve valid html we need to
split the single XML parent <p> into a number of sibling html <p>s at the
same level as the <blockquote>s we shall be generating.

======================================
<?xml version="1.0" encoding="UTF-8"?>
<text>
<p>This is the start of a <hi rend="bold">paragraph</hi>, and as it has been
well said <q rend="block"> You never know what will be
in a paragraph.</q> (Meaning among other things that the
possibility that <q rend="block">One good quote deserves
another </q>will <hi rend="italic">always</hi> have to be borne in mind)</p>
</text>
======================================

Let's indicate what the groups are that we intend to wrap in output
elements. I'll do that using start and end tags in square brackets, thus
[G1]Contents of group 1 [/G1] etc

<text>
<p>[G1]This is the start of a <hi rend="bold">paragraph</hi>, and as it has
been
well said[/G1] <q rend="block"> [G2]You never know what will be
in a paragraph.[/G2]</q>[G3] (Meaning among other things that the
possibility that[/G3] <q rend="block">[G4]One good quote deserves
another [/G4]</q>[G5]will <hi rend="italic">always</hi> have to be borne in
mind)[/G5]</p>
</text>

Comparing the placing of my  [Gn] tags to the location of true XML element
boundary markers, it will be evident that two of the groups we want to
"wrap" -- G2 and G4 -- are already "wrapped" in the source xml. So they are
already grouped for us. We could get at them by matching the elements
concerned and writing out their contents wrapped in the new "blockquote"
element of our choice.

So our first thought might be to put these templates into our XSLT.

======================================

<xsl:template match="p">
<p><xsl:apply-templates/></p>
</xsl:template>

<xsl:template match="q[@rend="block"]>
<blockquote><xsl:apply=templates/></blockquote>
</xsl:template>

=======================================

Which sadly doesn't do everything we want. Sure enough, the <q>s with
@rend='block' are successfully transformed into <blockquotes>, but those
<blockquotes> are children of the <p> we have created, and so will not
validate as html.

So now it's time to check our ropes, adjust our crampons, and tackle a steep
bit of the learning curve. The effort could be worth it.

Our first attempt may mean we climb ourselves into trouble. If we think in
serial processing terms, we may assume that the problem needs to be solved
within the template that handles the block-rendered <q>s. So we attempt to
modify that template, reach for the treacherous tool of
disable-output-escaping and try to progress by writing malformed markup,
passed off as <xml:text>, into the output: a </p> before our <blockquote>
and a <p> after it. It will all end badly -- literally so, since you'll
finish up, among various other problems, with one <p> too many after your
last <blockquote>.

Now there's a useful rule of thumb here that's often been formulated by
Wendell (though in a somewhat more rigorous way): if you think you need to
use d.o.e in this sort of way, it's a sign that you are trying to operate
too low down the tree (sorry for the directional confusion, real world trees
are vertically directed like real-world mountains, but CompSci trees alas
grow downwards from a root at the top...) So you need to step back up the
tree a level and re-examine the problem from that higher, "elevated",
standpoint.

In other words, we need to tackle the problem at the level where we process
the enclosing XML <p>, instead of trying to firefight only when a template
for a <q rend='block'> has been triggered. So let's first refine our <p>
template match so that it identifies for special treatment only those <p>s
where the problem arises, viz those that have at least one <q> child with
@[log in to unmask]

<xsl:template match="p[q/@rend='block']">
[more stuff to come here]
</xsl:template>

Now, when this template fires we want to do the following:
1. Output all the nodes before the first <q> child, [G1] in my example,
wrapping them in a <p>
2. Output the first <q> child, with its contents [G2] wrapped in a
<blockquote>
3. Output all the nodes before the next <q> child, [G3] in my example,
wrapping them in a <p>
4. Output the next <q> child, with its contents [G4] wrapped in a
<blockquote>
And so on, until we have output the last <q> child, duly re-wrapped as a
<blockquote>
5. Then we have finally to output all the nodes after that last <q> child ,
i.e. [G5], again wrapped in a <p>.

So how do we do that? Well, we've already done the necessary for steps 2 and
4. The problem is: how to precede and follow  steps 2 and 4 with the output
of the grouped items of our choice.
Let's pseudo-code that:

<xsl:template match="p[q/@rend='block']">

FOR EACH  ( < q> CHILD HAVING @rend='italic')
   BEGIN
       <p>
         PROCESS NODES PRECEDING THAT CHILD
       </p>
        PROCESS THE CHILD ITSELF (by rewriting as <blockquote>)
   END
        <p>
        PROCESS NODES FOLLOWING LAST <q> CHILD
        </p>
</xsl:template>

All we have to do is replace the pseudo-code by real XSLT and the job's
done, (or at least the apparently hard parts of it are -- there's more to be
done to cope with more realistic documents, but most of it is relatively
straightforward stuff that I won't go into).

What we need now is what Wendell alludes to earlier in this thread as "key
magic".  Like most prestidigitation, such tricks take a bit of practice to
perform, but in principle it's pretty simple. Alas, the nomenclature and
syntax required to put that principle into effect are somewhat rebarbative
in XSLT 1.0. And since the discussions of using keys for grouping found in
XSLT primers tend to concentrate on "datacentric" XML, it may not be
immediately clear how and why keys can also be useful for the grouping tasks
we TEI-ish docucentrics also encounter.

To group things prior to wrapping them we need (a) a way of specifying which
items are candidates for inclusion in one or the other of our groups  (b) a
way of singling out from among those candidates just the items that belong
in each specific group.

In the example case, what qualifies a document node as a potential member of
one of the groups we want to assemble than wrap in an html <p> is that it's
a child node of an XML <p> which also has <q> children where @rend='block".
We need to build an XPath expression that will select just those qualifying
nodes. Let's cast our net fairly widely at first then narrow the mesh. We
can begin with an expression that will select only those <p>s that have "q"
children with a @rend value set to 'block', like so: which we do  by adding
a simple predicate expression in square brackets after the "p":
match = "p[q/@rend='block']
We're making progress, but what we really want to group are not the
aggregated contents of the <p>s we have so far selected: we want as the
candidate constituents of our groups the text and element node children of
those <p>s  . So, leaving our predicate in place to filter the <p>s as
before, we extend the XPath to select children of those filtered<p>s,
whether they are text nodes or element children (along with any text
descendants those children may have), giving:
match = "p[q/@rend='block']/text()|*"

And that's it. All our potential candidates for grouped output will be
selected by that expression.

But how do we now do the actual grouping of those selected candidates into
the required html <p> chunks? Looking back to our pseudo-code, we see that
we want to be able to say to the XSLT processor, "Hand me just those nodes
which (a) match my general criteria for group inclusion AND (b) for which it
is the case that the next <q> node in document order is the next <q> node we
intend to output as a <blockquote>, so that I can wrap those nodes in a <p>
and output them first before that blockquote is processed".

To do that, we need to have a unique identifier for each <q> node with
@[log in to unmask] Which the XSLT processor is happy to give us. Pass an
expression selecting any node as a parameter to its generate-id() function
and the processor will return an id unique to that node (unique for this run
of the processor, that is, but that's all we need here). Now, each of the
candidate nodes we selected by the XPath above will have, sharing its common
<p> ancestry, one or zero immediate successor <q> nodes with @rend='block'
(why "or zero"? : look at [G5] above for the answer). When one of those
candidate nodes is the current node, that immediate successor <q> node will
be selected by the expression "following::q[@rend='block'][1], which says
"look from the current node along the "following" access and select the
first <q> node along that axis. (Purists might say the [1] is superfluous
for what we are about to do, but it helps some processors optimise the
instruction). And to get a unique id for that selected successor node, so we
can easily retrieve the nodes we want to associate with it, we pass the
expression selecting it to the generate-id() function, like so:

generate-id(following::q[@rend='block'][1])

Time to face the grim syntactic realities of creating the actual XSLT key.

At the top of our document (or more precisely, outside any templates or RTF
constructors), we need to put an expression like
<xsl:key match="XPATH THAT IDENTIFIES OUR CANDIDATE NODES FOR GROUPING"
name="SOMETHING SUITABLY MNEMONIC" use="EXPRESSION WHICH GROUPS THE
CANDIDATE NODES IN OUR CHOSEN WAY" />

We've already worked out the expression that will become our match value,
and the id we just generated turns out to be the criterion we need to form
our groups, because we will always want as members of our group for output
as an html <p> those nodes which have the same immediate successor <q> node.
So we can supply the necessary attribute values to create our key:
<xsl:key match="p[q/@rend='block']/text()|*" name="prequoteNodes"
use="generate-id(following::q[@rend='block'][1])" />

Now if we pass as parameters to the key() function the name of this key,
"prequoteNodes", plus the generated id of any <q> node that has
@rend='block' ,  the processor will return to us the nodes whose immediate
<q> successor is the one whose id we passed in. Which is almost exactly what
we want in order to assemble our groups.

"Almost" because there is a subtle problem with our selection expression.
Consider the second or later <q> node with @rend='block' within any given
<p>, such as the one labelled [G4] above. If you look backwards (i.e. in
reverse document order) from that node, you will see that the last node
which has node [G4] as its first successor <q> with @rend='block' is none
other than node [G2]. That is to say: if we pass in the identifier of node
[G4] to the key we just created, it will return all the nodes we want, plus
one we emphatically don't want, namely {G2]. Unless we do something about
that, we will process node [G2] twice, once where we want it, within its own
<blockquote> and then again as spurious content of the following html<p>.
Rather than risk severe brain-sprain by trying to refine our key-creation
expressions still further, we can set up our <p> handling template so that
as soon as the unwanted node tries to emerge, we pass over it. More of that
in a moment.

Beyond that, we need also to cater for the last group within any such <p>,
those that may have no <q> successor in that <p> and so not be returned by
the key we just created ([G5] above). We need another key that selects
exactly the same candidate nodes, but groups them this time by the
immediately *prior* <q> node with @[log in to unmask] All we have to do to create
this other key is alter the axis in the "use" value (and, of course, supply
a different name for the key) giving

<xsl:key match="p[q/@rend='block']/text()|*" name="postquoteNodes"
use="generate-id(preceding::q[@rend='block'][1])" />

Since we shall only ever use this key to retrieve the nodes after the last
<q> child of any given <p> there is here no problem about capturing a later
<q> node within that <p> since there will never be one in the circumstances
when we shall use this key.

Whoopee! Time for the real code for the <p> handler...

!-- Handle a <p> with at least one <q> child with @rend='block' -->
<xsl:template match="p[q/@rend='block']">
<!-- Get the set of child <q> nodes with @rend='block' and process each in
turn -->
<xsl:for-each select="q[@rend='block']">
    <!-- Get the id of the q node currently being processed into a
variable -->
    <xsl:variable name="thisID" select="generate-id(.)"/>
    <!-- Now get the nodes preceding the current <q> node back to either the
start of
           the <p> or the previous <q> from our key, so we can process them,
          wrapping the result in a <p>. BUT  if the nodes returned by our
key include
          the preceding <q> node, we must pass over it without processing
it. (see
          discussion in main text)
    -->
     <!--  So we can check for that preceding <q> node, get its ID into a
variable -->
    <xsl:variable name="prevqNode" select="generate-id(preceding::q[1])"/>
    <p>
    <! --Now use our key to get each node in turn -->
    <xsl:for-each select="key('prequoteNodes',$thisID)">
         <! -- Process the node unless its ID matches that of the previous
<q> -->
         <xsl:if test="generate-id(.) != $prevqNode">
            <xsl:apply-templates  select="."/>
         </xsl:if>
    </xsl:for-each>
    </p>

   <!-- Now process the actual <q> node -->
    <xsl:apply-templates select="."/>

</xsl:for-each>

<!-- By this point we have processed everything in the source <p> apart from
any nodes
        following the last <q> element with @[log in to unmask] So we finish by
retrieving
        and processing those remaining nodes, if they exist.
-->
<!-- Get the id of the last such q child into a variable -->
<xsl:variable name="lastQuote"
select="generate-id(q[@rend='block'][last()])"/>
<!--Use our key to retrieve any  nodes after that last q node and process
them -->
<p><xsl:apply-templates  select="key('postquoteNodes',$lastQuote)"/></p>

</xsl:template>

That's it.

Add handlers for  q and hi elements, and make the root handler do some html
wrapup, and you will get as output something like:

========================
<html>
<body>
<p>This is the start of a <strong>paragraph</strong>, and as it has been
well said </p>
<blockquote>You never know what will be in a paragraph</blockquote>
<p> (Meaning among other things that the possibility that </p>
<blockquote>One good quote deserves another </blockquote>
<p>will <strong>always</strong> have to be borne in mind)</p>
</body>
</html>
========================

To borrow what Ralph Vaughan Williams said about one of his symphonies: "I'm
not sure I like it, but I think it's what I meant".

Since anyone who wants to explore these techniques is probably going to want
to play around with this material, I have zipped up the sample document, the
"complete" stylesheet [scare quotes because of course in any real
application the sheet would have to be greatly extended] and sample output
and made them available for download at
http://www.anglo-norman.net/sitedocs/workshop/keymagic.zip
from whence anyone interested can retrieve them. I'll keep them there for a
few weeks, but this is not intended to be a polished permanent offering but
just an quick adjunct to this thread.

Michael Beddow

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
June 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999
October 1999
September 1999
August 1999
July 1999
June 1999
May 1999
April 1999
March 1999
February 1999
January 1999
December 1998
November 1998
October 1998
September 1998
August 1998
July 1998
June 1998
May 1998
April 1998
March 1998
February 1998
January 1998
December 1997
November 1997
October 1997
September 1997
August 1997
July 1997
June 1997
May 1997
April 1997
March 1997
February 1997
January 1997
December 1996
November 1996
October 1996
September 1996
August 1996
July 1996
June 1996
May 1996
April 1996
March 1996
February 1996
January 1996
December 1995
November 1995
October 1995
September 1995
August 1995
July 1995
June 1995
May 1995
April 1995
March 1995
February 1995
January 1995
December 1994
November 1994
October 1994
September 1994
August 1994
July 1994
June 1994
May 1994
April 1994
March 1994
February 1994
January 1994
December 1993
November 1993
October 1993
September 1993
August 1993
July 1993
June 1993
May 1993
April 1993
March 1993
February 1993
January 1993
December 1992
November 1992
October 1992
September 1992
August 1992
July 1992
June 1992
May 1992
April 1992
March 1992
February 1992
January 1992
December 1991
November 1991
October 1991
September 1991
August 1991
July 1991
June 1991
May 1991
April 1991
March 1991
February 1991
January 1991
December 1990
November 1990
October 1990
September 1990
August 1990
July 1990
June 1990
April 1990
March 1990
February 1990
January 1990

ATOM RSS1 RSS2



LISTSERV.BROWN.EDU

CataList Email List Search Powered by the LISTSERV Email List Manager