I would appreciate some advice on a pilot project to create TEI-Lite
like transcriptions from OCR texts of novels from the 18th to the
early twentieth century. The basic idea is to create manipulable and
interoperable transcriptions that lend themselves to linguistic
annotation but maintain a connection to the layout of the printed
source. The resultant procedures should be simple enough for
individual users to create editions of their own with appropriate
guidelines. In some ways it's a kind of "Project Gutenberg plus."
There was some talk of TEI versions of Gutenberg texts, but nothing
seems to have come of it so far.
The procedures start from an algorithmically produced TEI version
that is derived from an algorithmically produced 'white space XML'.
The source text in every case is the equivalent of what an old-
fashioned gardening book called 'shrubs of merit', first editions or
other public domain editions that for one reason or another have
standing as 'good enough' texts. There is no ambition to capture all
minutiae of typography or layout. But I assume that users will find it
helpful to be able to align a line of transcription with a line of
printed text. On the other hand, soft hyphens can be ignored, and the
second part of a hyphenated word will be added to the line where the
word began.
Lines with running heads will be ignored, as will be lines that carry
information that have more to do with where the paper came from than
with the text, such as "B2" and the like.
Are these reasonable principles? The trickiest business involves
hyphenation at line or page breaks. Some encoding projects ignore line
breaks. Others observe hyphenation when it occurs at the end of the
page but ignore it at the end of a line. Hyphenation is not a problem
if you think of a text as something to be displayed for readers. If
you tokenize or add linguistic annotation, hyphenated words create
problems of a solvable but pesky kind. It is certainly simpler to
deal with word tokens if you can assume that a line will always
consist of whole words only, and you do not really interfere with the
reader's abiliy to align the transcription with the facsimile, which
is a non-trivial benefit. But what analytical or critical affordances
are sacrificed by ignoring hyphenation?
|