Brett Zamir wrote:
> If <w> cannot have whitespace (despite the definition indicating that a
> word is "not necessarily orthographic"), what about cases like "à la"?
> By requiring whitespace, one also cannot take advantage of the @lemma
Where does it say that <w> cannot contain whitespace? The <w> element
delimits "words", which might be defined in a number of different ways,
and might not necessarily correspond with spelling convention, so the
content of a <w> might well include whitespace. Equally, you cannot
assume that the end of a <w> necessary implies following whitespace. For
example you might segment "of course it isn't" as follows
on the grounds that "of course" is a single lexical item with a single
linguistic function, and "isn't" is really two. Stranger things have
been known to happen. Yes, this has implications in that you have to be
careful where you put (or don't put) the whitespace between your <w>s if
you don't want it to be eaten by the XML parser.
As currently defined, the @lemma attribute on <w> cannot contain
whitespace, but that's a different question. A lemma is in any case an
artificial construction, so retaining whitespace within it can't arise
as an issue.
> Also, could names (e.g., given and surname) be considered as words?
Err, of course. What else might they be?