Print

Print


We've got quite a bit of recently acquired text in Maori and there are a 
number of easily detectable errors (because we know more about the 
language than the encoders). The texts often include both English and 
Maori, and project work has the effect of making Maori text 'valuable' 
(i.e. it's used for linguistic analysis). Currently we mark apparent 
errors with:

<foreign xml:lang="en">...</foreign>

Which has the effect of removing the apparently erroneous fragment from 
linguistic analysis, because this is only done on xml:lang="mi" 
fragments. However, usually these are OCR errors in Maori rather than 
words/sentences in English.

Is there a better tag for this?

Properties that would be great would be (a) the ability to keep track of 
automatically inserted tags (so for example they could be removed prior 
to processing by an updated version of the script without inferring with 
manually inserted tags); (b) not privileging one language over another; 
(c) the ability to be added in a single pass over the text without the 
need to store the entire document in memory (i.e. no requirement for a 
list of tags in the header); and (d) preferably not using the 'n=""' 
attribute, which we're already using for too many different things in 
too many places.

Once we have time to look at errors, we can either correct the actual 
text (for OCR errors) or use:

<foreign xml:lang="en">...</foreign> (for English words)

or

<choice><sic>...</sic><reg>...</reg></choice> (for apparent mistakes in 
the orginial)

cheers
stuart

-- 
Stuart Yeates
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository