> These segmentation elements are tricky, and I would be the last to
> claim that we currently have them right, but fwiw, I think the
> intended use of <w> is simply to tokenize strings of text with
> minimal markup. If you need to use more complex markup within your
> tokens, I'd suggest using <seg type="w"> or <seg type="token">
> But clearly this needs more thought!
Indeed, it does. As Syd Bauman knows well, Brian Pytlik Zillig and
Steve Ramsay at Nebraska have been wrestling with this in the context
of providing tokenization and linguistic annotation for texts in the
TCP corpus. There you have tens of thousands of instances like
Clearly in these instances you have a single "token space", but you
cannot at the moment do
I am glad you're implicitly blessing kludges like <seg type="token">.
I have thought of convenience elements like <tok> with a content model
identical to <seg>. Actually, it doesn't take too radical an
expansion of the current content model of the <w> element to
accommodate the great majority of oddities that occur in actually
encoded TEI texts of one kind or another.