On 02/16/2014 03:06 AM, Syd Bauman wrote:
> And I'll just note (as Kevin knew I would) that this method is a bad
> idea. It presumes a specific processing model (the one Paul
> described -- "retrieve text [only] by the <div>"). Even if I liked
> that processing model, I would suggest that to curtail other
> processing models for such a small gain is a bad idea. Besides, I
> think it makes even that processing model harder in some cases.
Particularly in cases where a document collection contains instances of
widely-varying depth and complexity, where "the <div>" may not be
appropriate for all of them.
> Let's presume the main question you want to ask of any given element
> is "what page am I on?". Let's also presume [...]then the answer is
> always the same, and quite easy:
This is what we have done in the CELT documents (except a small number
where the source document is problematic)
> But if you move the <pb> to *inside* a division that starts on that
> page, you have two problems.
> Note that the <pb> element lies where it is in the book -- *between*
> the two chapters. This encoding asserts that chapter 3 ends on page
> 51 and chapter 4 starts on page 52. I think that matches most
> people's intuition.
> <div type="chapter" n="4">
> <pb n="52"/>
> This clearly asserts that chapter 4 starts on page 51. Which is why,
> when we ask "what page am I on?" using preceding::pb/@n from the
> <div> we get the wrong answer ("51").
In my experience, the problem you describe arises when people think of
<pb/> as conflating the the printed page number with the fact that
Chapter 4 starts on p.52.
> So fidelity in encoding the page break position does make it a little
> harder to code "what page am I on?" from an extracted <div>. But it's
> no big deal, and it makes coding it for the general case a lot
No matter which way you do it, the essential thing is to document the
policy, and stick to it, even if that means processors require a little
additional code to cope with the more flagrant examples.
> (I really like this example because it also demonstrates the
> short-coming in TEI encoding of end-of-line hyphens: is the word
> after "punish" supposed to be re-constituted as "evil-doers" or
> "evildoers"? Many encoders or projects may not wish to commit to one
> or the other, but if you do, TEI does not give you a standard way to
> differentiate. But I'm getting side-tracked ...)
This is avoided if you use a ­ character for hyphenation introduced
as an artifact of the typesetting process, and a normal hyphen for those
written by the author. But that may add dependence on the ability of
retrieval engines to elide soft-hyphens.