Print

Print


This is also our problem with the Epistemon TEI 
database (16th century prints): after having 
thought about a new TEI element that skips line 
and page breaks in order to join the two parts of 
the word (sometimes without any soft hyphen),  we 
decided that an economic solution would be to 
generate a subtext without hyphenation, so that 
it can be processed by the linguistic software: 
but the visible text displays the same features 
in front of the facsimile (hyphens, running 
heads, signatures, catchwords).
Marie-Luce Demonet
Responsable des Bibliothèques Virtuelles Humanistes
http://www.bvh.univ-tours.fr
Centre d'Etudes Supérieures de la Renaissance, UMR-CNRS
Université François-Rabelais, Tours

At 19:19 -0600 26/01/09, Martin Mueller wrote:
>I would appreciate some advice on a pilot 
>project to create TEI-Lite like transcriptions 
>from OCR texts of novels from the 18th to the 
>early twentieth century. The basic idea is to 
>create manipulable and interoperable 
>transcriptions that lend themselves to 
>linguistic annotation but maintain a connection 
>to the layout of the printed source. The 
>resultant procedures should be simple enough for 
>individual users to create editions of their own 
>with appropriate guidelines.  In some ways it's 
>a kind of  "Project Gutenberg plus."  
>There was some talk of TEI versions of Gutenberg 
>texts, but nothing seems to have come of it so 
>far.
>
>The procedures start from  an algorithmically 
>produced TEI version that is derived from an 
>algorithmically produced  'white space XML'.
>The source text in every case is the equivalent 
>of what an old-fashioned gardening book called 
>'shrubs of merit', first editions or other 
>public domain editions that for one reason or 
>another have standing as 'good enough' texts. 
>There is no ambition to capture all minutiae of 
>typography or layout. But I assume that users 
>will find it helpful to be able to align a line 
>of transcription with a line of printed text. On 
>the other hand, soft hyphens can be ignored, and 
>the second part of a hyphenated word will be 
>added to the line where the word began.
>
>Lines with running heads will be ignored, as 
>will be lines that carry information that have 
>more to do with where the paper came from than 
>with the text, such as "B2" and the like.
>
>Are these reasonable principles? The trickiest 
>business involves hyphenation at line or page 
>breaks. Some encoding projects ignore line 
>breaks. Others observe hyphenation when it 
>occurs at the end of the page but ignore it at 
>the end of a line. Hyphenation is not a problem 
>if you think of a text as something to be 
>displayed for readers. If you tokenize or add 
>linguistic annotation, hyphenated words  create 
>problems of a solvable but pesky kind.  It is 
>certainly simpler to deal with word tokens if 
>you can assume that a line will always consist 
>of whole words only, and you do not really 
>interfere with the reader's abiliy to align the 
>transcription with the facsimile, which is a 
>non-trivial benefit. But what analytical or 
>critical affordances are sacrificed by ignoring 
>hyphenation?


--