Print

Print


Martin Mueller wrote:

> Is there a set of rules or best practices about what to do with space
> between words when you encode down to the word level? We have in one
> project used space between <w> elements, and the parser respects
> this, although it seems theoretically wrong.

This is, for some types of project, an important issue on which the
Guidelines currently have little to say, but I'm puzzled by some of the
terms in which it's expressed here, and some of  the responses seem to me to
be mixing up things that need to be kept conceptually distinct, even though
there are overlaps in practice.

What is meant by "the parser respects this", and why does such "respect"
seem "wrong"? Presumably it doesn't mean "the parser doesn't raise a
validation error", because of course in unmodified TEI it would never do so:
the likely parents of <w> elements allow for mixed content, and space
characters are to the parser  just  #PCDATA like any other characters.

To alter the parser's behaviour would require defining a new class of
segment that disallowed mixed content. An SGML parser would then recognise
inter-element white space in the content of such a segment as "ignorable",
allowing any application serviced by that parser to treat it differently in
any desired way. Under such a content model, white space intended by the
encoder to be "non-ignorable" in the SGML sense would need to be placed in a
<c> element. XML parsers are of course obliged by the spec to report *all*
white space to the application they service. Many modern XML parsers, if in
validating mode, will report  what an SGML parser would treat as "ignorable"
white space via a separate callback for "lexical" character data, allowing
emulation of SGML ignorable space, but they aren't obliged to do so. If
"respects" means simply "white space between <w>  elements in the source
instance is reported as such by the parser", then that is anything but
wrong, because it's precisely the behaviour that the XML spec inexorably
demands.

> One could also deal with
> the space at the processing level and have a rule to
> the effect that  a word element is followed by a space
> unless the content of the next
> element begins with a punctuation mark, etc.

There are two problems with the "assume that <w>s are bounded by white space
and render accordingly" idea. First, as Martin Holmes indirectly warns in
his response, <w>s can nest, and they can also legitimately contain white
space themselves, if the encoder decides that a run of characters
containing white space should count as a single <w>. <w> belongs to the
tagset for encoding linguistic analysis or description, and the needs of
such encoding mean that the boundaries of a <w> do not always concide with a
common-or-garden lexical token or "word". As the Guidelines put it:  "<w>
represents a grammatical (not necessarily orthographic) word". Secondly, it
leaves unresolved the related problem of what to do with punctuation marks.
Presumably the "next element" referred to here as having content  "beginning
with a punctuation mark" would be another <w>. But putting punctuation marks
within <w> elements, at either "end", certainly does seem wrong --  if they
can legitimately be associated with a segment, then in most cases that
segment would need to be at least a phrase, not an individual <w>; but if no
such phrasal segments are being marked up, the punctuation marks seem to
have no home. If it really matters (and in the absence of a fuller context,
it's hard to judge whether it does or should) one method is  to use <c>s
between <w>s to enclose "word"-separating white space and punctuation marks.
For maximum explicitness, since <c> inherits the attributes (though not the
mixed content model) of <seg>, it would be possible to type such <c>s as
containing either separator or punctuation characters, or whatever other
typology seemed useful. Once that's done, any white space between <w> or <c>
elements can be disregarded when processing. The parser itself can't be told
to ignore it, but any application fed by the parser can be made to do so,
since it is readily identifiable.

Of course, the problem of transcribing standard interword spaces raised by
<w> level segmentation of Western scripts is a different question from the
encoding of sources in scriptio continua, where the indication (or not) of
word boundaries in the markup is a matter of editorial policy and judgement
rather than simply of encoding practice; and it's different, too, from the
recording of spaces in a source that have some kind of specific semantic or
structural import beyond being standard pervasive token boundary markers.

So this is a basic markup question about the appropriate encoding of white
space (and punctuation) in a document instance that segments down to <w>
level, no matter how is is to be rendered or otherwise processed. There is
no necessary connection with XSLT, although in practical terms XSLT might
well be the tool called upon to render such markup with whatever white
spacing is desired. But that doesn't mean that the question "how should I
encode whitespace of type X in documents of type Y?" has anything to do with
questions like "How do I tell XSLT to preserve white space between child
elements of a mixed-content node?", and trying to answer the first sort of
question by raising the second doesn't really help resolve either.

Whitespace processing in XSLT is notoriously tricky, mainly because of the
implications of ground rules for whitespace specified for XML parsers, which
make flexibly intelligent treatment of spaces necessarily complex. I don't
think I'd agree that current mainstream XSLT processors are "flaky" in their
whitespace handling. Divergent yes, but I know of nothing about that
divergence that contravenes the relevant specs or that is undefined or
unpredictable, and which therefore can't be managed given appropriate
understanding of how and why the different processors behave as they do,
combined with a grasp of the principles of whitespace in XSLT. Nor, given
the treatment of whitespace mandated by the XML spec, can I see that the
authors of XSLT had any alternative to the way strip-space works. Since the
XML spec provides no way of reliably identifying SGML-type "ignorable"
whitespace (not even via access to a DTD), there is no way of making strip-
or preserve- space recognise what space should be left and what should be
kept. Hence the behaviours: strip-space removes the lot and preserve-space
keeps the lot. There are some ways of mitigating the consequences: string
values can be taken and processed via normalize-space(), and both strip- and
preserve-space can be applied element by element via XPaths of arbitrary
complexity, so that specific mixed-content elements can be singled out for
space preservation while other elements have inter-element space stripped.

Once that is taken care of, in my experience most of the unexpected
whitespace people find in their XSLT output leaks in from their XSLT sheet
iteslf rather than stemming from their input document or being an
intentional result of the actual transforms they have coded in their
templates. And here there is a relevant connection to the first set of
issues. Under the as a "spaces-and-punctuation-as-<c>-content" approach,
outputting the white space or punctuation marks contained in such <c>s by
means of  <xsl:text> elements enclosing the same content would have a benign
side-effect: because of XSLT's rules about contiguous whitespace-only
nodes, it would inhibit leakage of whitespace from the XSLT sheet itself
into the output tree. That leakage  (over which xsl:strip/preserve -space
settings have no influence whatever) is in my experience a major reason why
people prematurely abandon the struggle to get control of whitespace in
their transformed documents

Michael Beddow