Print

Print


Hi Martin! My first thoughts are the following.

marking dot as <pc>
------- --- -- ----
   <w>Beaumont</w> <w>Gent</w><pc type="stop.abse">.</pc>

Where the value "stop.abse" comes from P3 6.2:
 "stop.abbr"   a stop used to end an abbreviation
 "stop.sent"   a stop used to end a sentence
 "stop.abse"   a stop used both to end an abbreviation and to end a sentence
 "stop.dec"    a stop used as a decimal point
 "comma.dec"   a comma used as a decimal point
 "middot.dec"  a midline dot used as a decimal point
 "stop.space"  a stop used as a numeric space character
 "comma.space" a comma used as a numeric space character

not encoding the dot
--- -------- --- ---
As you point out, the period is presentational markup for "word
before me is an abbreviation" and (simultaneously) for "end of
sentence". If you mark these features explicitly, why bother with the
period at all? How about

        <s>
          <w>Written</w>
          <w>by</w>
          <persName>
            <w>Francis</w>
            <w>Beaumont</w>
          </persName>
          <w><abbr>Gent</abbr></w>
        </s>

If you really want to record that there was a period in the source,
you could consider it a renditional feature of the source used to
indicate a an abbreviation or sentence (just as italics are used to
indicate emphasis or foreignness):

        <s rend="post(.)">
          <w rend="case(capitalized)">written</w>
          <w>by</w>
          <persName>
            <w rend="case(capitalized)">francis</w>
            <w rend="case(capitalized)">beaumont</w>
          </persName>
          <w rend="case(capitalized)">
            <abbr rend="post(.)">gent</abbr>
          </w>
        </s>

(The above attributes the period to both the sentence and the
abbreviation. If you encode like this, your software will have to
magically know that there was only one in the source for purposes of
facsimile-like output or counting dots. Alternatively, the markup
could attribute the dot to just the sentence or just the
abbreviation, but I'll bet you find that unsatisfactory.)

> I'll be grateful for some advice on how to handle tokenization in
> cases where a sentence ends with an abbreviated word. Take a
> sentence that ends as follows
> 
>  "with Elegies , written by Francis Beaumont Gent."
> 
> In this case, the dot after Gent does double duty. It marks an
> abbreviation and as such is part of the token that it designates as
> abbreviated. It also terminates a sentence, but the sentence-ending
> dot is omitted because the human reader does not need, or would by
> confused by,
> 
>  "with Elegies , written by Francis Beaumont Gent.."
> 
> We have a linguistic annotation tool that has an "end of sentence"
> or 'eos' attribute for every word token. Its output of
> 
> <w eos="0">Beaumont</w> <w eos="1">Gent.</w>
> 
> is unambiguous, but there are several reasons why this is not very
> TEI-friendly. I would like to find a way of encoding this
> information in such a way that it would parse and make sense within
> out-of-the-box TEI P5.
> 
> One way to do so would be to recognize the double function of the
> dot by representing each of its functions separately. The result
> would be
> 
> <w>Beaumont</w> <w">Gent.</w><pc type="s"></pc>
> 
> or its abbreviated representation 
> 
> <w>Beaumont</w> <w">Gent.</w><pc type="s"/>
> 
> In such a scheme, sentence-terminal punctuation would be marked
> either by a <pc type="s"> element that either contains an explicit
> punctuation mark or is empty but nonetheless marks a particular
> location where a sentence ends.
> 
> It parses and is reasonably economic. Does it make sense as well or
> are there better ways of achieving an equally TEI-friendly result?