Print

Print


On Mon, 2005-09-26 at 11:45, Michael Beddow wrote:
> Martin Mueller wrote:
> 
> > Is there a set of rules or best practices about what to do with space
> > between words when you encode down to the word level? We have in one
> > project used space between <w> elements, and the parser respects
> > this, although it seems theoretically wrong.
> 
> This is, for some types of project, an important issue on which the
> Guidelines currently have little to say, but I'm puzzled by some of the
> terms in which it's expressed here, and some of  the responses seem to me to
> be mixing up things that need to be kept conceptually distinct, even though
> there are overlaps in practice.
> 
> What is meant by "the parser respects this", and why does such "respect"
> seem "wrong"? Presumably it doesn't mean "the parser doesn't raise a
> validation error", because of course in unmodified TEI it would never do so:
> the likely parents of <w> elements allow for mixed content, and space
> characters are to the parser  just  #PCDATA like any other characters.

The parsers are behaving as prescribed in the XML Spec. The difference
is in the attitude to white-space.

In SGML, a DTD was compulsory, so all parsers would always know in
advance which element types were declared with element content, which
were declared with mixed content, which were declared with PCDATA
content, and which were declared EMPTY.

In XML, a DTD is optional, so a parser might have no way of knowing any 
of this. To prevent parsers looking for end-tags for EMPTY elements, the
Null End Tag trick was introduced (the /> termination), but the problem
remained for the distinction between element content, mixed content, and
PCDATA declared content.

Knowing this in advance is what lets SGML applications ignore all 
white-space in element content and preserve it all in mixed or PCDATA
content.

The rule is therefore that all XML parsers shall pass *all* white-space
through to the application (editor, browser, transformation engine, etc)
signalled with the context in which it was found, and let the 
application decide what to do with it. 

So an application like XSLT has a switch to allow the preservation or 
removal of white-space. Unfortunately the distinction is binary: you
can either keep it or lose it, and no distinction is made between
white-space found in element content and that found in mixed or PCDATA
content because *at the time of parsing it*, the parser may not know
what further type of content awaits it within the current element.

The result is that text nodes containing only white-space tokens are
removed entirely when the strip-space switch is ON. My argument is that
if at least one subelement has already been encountered in the current
element, then white-space-only nodes should no longer be suppressed in
this element, but collapsed to a single space token. This would still
permit the suppression of leading white space nodes, which is almost
always what you want, but it would defeat the suppression of trailing
white-space nodes (because in mixed content a preceding element would
have been encountered).

However, all attempts to persuade people that this is needed have fallen
on deaf ears so far, and I have not yet seen a workable alternative. In
many cases it is irrelevant, as browsers perform white-space collapsing
anyway, and so does an application like TeX. The problem only arises
when strip-space is ON (because it makes life a little neater to get rid
of all the white-space between elements in element content), and you
have adjacent elements in mixed content separated by white-space. In
this case the space disappears and the output concatenates the text
content of the two elements, which is grossly suboptimal.

I can well see the problem when working without a DTD, but when using
one I see no reason why a parser should not pass the information about
the environment to the application, so that the application would know
that the white-space was found in element content or mixed content.
It is inadequate to argue that this behaviour can be modelled by listing
in the "elements" attribute of the strip-space element of XSL[T] all the
element type names for which you require white-space to be stripped, 
firstly because some element types can occur both in element content
and in mixed content, and because the required behaviour applies
to white-space nodes in mixed content only. I would happily forego the
suppression of leading white-space nodes if the correct behaviour were
possible. It *is* possible to model the behaviour in a template for
text() with strip-space turned off, but only by specifying the entire
range of possible circumstances (for example as an xsl:choose list).
This is not a viable alternative when dealing with very large DTDs
like the TEI.

My conclusion is that the needs of text-document users are not being 
met with the current specification and behaviour. Unfortunately I have
been unable to persuade anyone that this is a real problem. I would be
delighted to be proved wrong, of course.

///Peter