I've been mulling over how to conduct queries that work around in-line
markup (such as that used in much of the TEI), and I'd be grateful for
comments from anyone else who may be interested in this question. What I
mean by "working around markup" is searching for a string of text that is
contiguous in an original print source, but where the beginning of the text
is contained in one element and the end in another, so that the end tag of
the first element and beginning tag of the second break the contiguity (or,
similarly, where part of the text is enclosed in an element and the other
part is adjacent PCDATA).
Here's a sticky practical example. Someone preparing a textual critical
edition according to the parallel segmentation method will have divided the
text into PCDATA inside some element or another (where all witnesses agree)
and "app" elements containing "rdg" elements (perhaps inside "rdgGrp"
elements), where they don't. If it is then necessary to search for a string
from a particular witness that happens to cross the boundaries of an "app"
element, it isn't clear how to tell the system to work around the markup. A
simpler example involves ignoring page breaks, line breaks, or other empty
milestone tags that might intrude in a search string.
Two solutions come to mind:
1) Use traditional TEI markup and then generate any and all plain-text,
markup-free views that I might wish to search for strings, with stand-off
markup that associates locations in the plain-text versions their
counterparts in the marked-up one. A search engine could then search the
unmarked version, whereupon a system could traverse the link and render the
2) Use only stand-off markup, so that text that is contiguous in the
original is contiguous in the transcription. This isn't a TEI approach.
How have others dealt with this question?
Professor David J. Birnbaum
Department of Slavic Languages and Literatures
1417 Cathedral of Learning
University of Pittsburgh
Pittsburgh, PA 15260 USA
Voice: 1 412 624 5712
Fax: 1 412 624 9714
Email: [log in to unmask]