I may be over-problematizing this, but the main use case for this approach
is not corrections in the XML structure but corrections of words.In my
ideal, and perhaps Utopian' environment curators "see" a page, but behind
that page is a tokenized and linguistically annotated text.If you click on
a word or phrase it opens a pop-up window with metadata and some form that
lets you enter a textual correction, which goes through review stages
before being integrated into the text. This is a page-oriented version of
the much more primitive AnnoLex tool (http://annolex.at.northwestern.edu).
Some common encoding errors in the TCP texts involve just renaming an
element. They may be discovered by readers who don't know or care about
TEI but know that this line is or is not verse. I assume or hope, perhaps
vainly, that a lot of textual will take the form of "curation en passant"
where a reader comes across something and suggests a fix right then and
there. Doing that kind of work should be easier than ordering a book from
Amazon. I'm not sure we'll ever have enough resources to build the
sophisticated and robust infrastructure for that to happen so that many
hands will indeed make light work.
I note your concern that the re-integration of page fragments may be
difficult. That may be where this project collapses. But the eXist
function mentioned by Jens Petersen in his response looks interesting.
That said, I must admit that while I think I have a pretty good
understanding of how one might design an environment that lets users spot
errors in words and suggest corrections, I have a much shakier grasp of
how to deal with encoding problems. If it's just the name of an element,
it simple: it just means changing the spelling of something. But what if a
reader correctly notes that "this 'Con.' in italics is not part of the
line. It is a verse medial change of speakers". Fixing that is a
multi-step procedure. Perhaps one can't do better than have a really
simple "report an error" procedure.
Errors of that type are quite in the TCP texts, and their discovery is
well within the competence of many readers. They will be discovered
hundreds and thousands of times by readers. But instead of being annoyed
about the lamentable quality of the corpus, they should feel motivated to
do something about it, and the doing something about it should be very
Professor emeritus of English and Classics
On 2/22/15 11:46 AM, "Sebastian Rahtz" <[log in to unmask]> wrote:
>I am a little worried about this. Do you have practical evidence, Martin,
>who are willing to correct the XML files will only do so if they can work
>fragments of the form <PAGE>..</PAGE>? I am sure a system _could_ be set
>to extract the fragments into well-formed expanded XML, and then put back
>the originals, but checking the put-back hasnšt corrupted the non-page
>seems quite problematic.
>My feeling is that anyone capable of editing a TEI XML file at all is also
>capable of finding the right <pb/> for the facsimile image they are
>and editing the right XML.
>May I, with respect, suggest that you are over-problematizing this?
>Chief Data Architect
>University of Oxford IT Services
>13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431