John Carlson wrote:
> Perhaps I'm being dull about this, but the following (simplified) snippet
> doesn't appear to be valid TEI p5 encoding:
> but this is:
Well, for better or worse, the <w> element is defined that way! The idea
is that <w> will be used for applications in which the presence of the
full range of possible phrase level elements would confuse things. It's
a specialised form of <seg>, intended for tokenization only. So it
permits just text, <g>, global elements, and other <seg>s. As you say,
the inclusion of the last of these re-opens the door which the use of
<w> itself just shut, which seems a bit strange. Maybe <seg> shouldn't
be permitted there at all.
> To me, the <seg> elements in the second snippet seem like a hack that
> shouldn't be necessary (an abbreviated form of "with" is still just a word,
> correct?). Can someone explain the rationale behind disallowing the first
> example? Is it perhaps an oversight? Am I somehow using the <w> element
These segmentation elements are tricky, and I would be the last to claim
that we currently have them right, but fwiw, I think the intended use of
<w> is simply to tokenize strings of text with minimal markup. If you
need to use more complex markup within your tokens, I'd suggest using
<seg type="w"> or <seg type="token"> instead.
But clearly this needs more thought!