On 02/17/2014 12:47 AM, Torsten Schaßan wrote:
> Again, especially from the point of processing shouldn't we think of
> processing *text* (or more general: content) instead of structural
> entities? Which would mean we wouldn't ask "At which page starts
> this div?" but we would ask "At which page starts *the content* of
> this div?"
This is a very valid argument. It is expecially relevant where retrieval
is done by search, not by structure, so when the span of text is found,
preceding::lb will answer the question.
> Let's presume the main question you want to ask of any given element
> is "what page am I on?".
I think this is a partial error, for practical application. IMHE the
most requested extraction of metadata is the generation of references.
The researcher -- having found the text she wants -- now wants to be
able to refer others to it (possibly in the form of a traditional
citation). The page number is a small part of this: the much bigger part
is the structural information: chapter/section for example (there are of
course documents with only page/folio numbers to go by, documents whose
text is one single undifferentiated stream).