I'll be grateful for some advice on how to handle tokenization in cases where a sentence ends with an abbreviated word. Take a sentence that ends as follows
"with Elegies , written by Francis Beaumont Gent."
In this case, the dot after Gent does double duty. It marks an abbreviation and as such is part of the token that it designates as abbreviated. It also terminates a sentence, but the sentence-ending dot is omitted because the human reader does not need, or would by confused by,
"with Elegies , written by Francis Beaumont Gent.."
We have a linguistic annotation tool that has an "end of sentence" or 'eos' attribute for every word token. Its output of
<w eos="0">Beaumont</w> <w eos="1">Gent.</w>
is unambiguous, but there are several reasons why this is not very TEI-friendly. I would like to find a way of encoding this information in such a way that it would parse and make sense within out-of-the-box TEI P5.
One way to do so would be to recognize the double function of the dot by representing each of its functions separately. The result would be
<w>Beaumont</w> <w">Gent.</w><pc type="s"></pc>
or its abbreviated representation
<w>Beaumont</w> <w">Gent.</w><pc type="s"/>
In such a scheme, sentence-terminal punctuation would be marked either by a <pc type="s"> element that either contains an explicit punctuation mark or is empty but nonetheless marks a particular location where a sentence ends.
It parses and is reasonably economic. Does it make sense as well or are there better ways of achieving an equally TEI-friendly result?