My comments about the wisdom of encoding page breaks and other
accidentals were not intended as a criticism in the personal sense. I
hope no offence was taken. As an Anglo-Saxonist whose dissertation
centred around minute differences in spacing, spelling, and
word-division, moreover, I'm usually not one to cavil at too much
information. And finally, not being a lexicographer, I'm not 100% up on
any current debates about hyphenation--though I understand the issue, I
My point about the tagging of the texts for the lexical database,
however, was and is that the TEI forces one to make a decision about how
one is looking at a text. No matter how interesting capitalisation,
punctuation, page and line breaks are, they remain graphic features.
Like italics, they tell you something about the structure and
linguistics of the object you are looking at, but they need to be
interpreted in a way that a straight "word" does not. In my own work I
have found it increasingly necessary to work at different levels, each
with a different focus:
Facsimile level (images) --> Transcription Level (words as graphic
units) --> Edition Level (words as linguistic units)
Converting text from one level to another is not a particuarly difficult
thing to do. I did my initial conversions using stylesheets that
dropped all segment-internal information and converted <seg>s to <w>s.
But I have found from a coding perspective that it is essential to keep
the levels separate. It is a version, perhaps, of the multiple
If graphic details like page breaks or hyphenation are essential to a
project that is otherwise lexical in orientation, I would therefore
suggest keeping essentially two copies of everything: one coded (and
proof-read) for graphic information, and a second with all graphic
information stripped out for lexical analysis. The second, linguistic,
text could be either constructed on the fly or automatically
batch-converted from the graphic-oriented transcriptions. I'm not sure
what you would do if you also needed detailed morphological coding.
You'd probably need to maintain two versions of the corpus rather than
convert from the graphic on an ad hoc basis.
What an interesting problem!
Daniel Paul O'Donnell
Department of English
University of Lethbridge
Lethbridge AB T1K 3M4
Tel (403) 329-2377
Fax (403) 382-7191
[log in to unmask]