We've got quite a bit of recently acquired text in Maori and there are a
number of easily detectable errors (because we know more about the
language than the encoders). The texts often include both English and
Maori, and project work has the effect of making Maori text 'valuable'
(i.e. it's used for linguistic analysis). Currently we mark apparent
Which has the effect of removing the apparently erroneous fragment from
linguistic analysis, because this is only done on xml:lang="mi"
fragments. However, usually these are OCR errors in Maori rather than
words/sentences in English.
Is there a better tag for this?
Properties that would be great would be (a) the ability to keep track of
automatically inserted tags (so for example they could be removed prior
to processing by an updated version of the script without inferring with
manually inserted tags); (b) not privileging one language over another;
(c) the ability to be added in a single pass over the text without the
need to store the entire document in memory (i.e. no requirement for a
list of tags in the header); and (d) preferably not using the 'n=""'
attribute, which we're already using for too many different things in
too many places.
Once we have time to look at errors, we can either correct the actual
text (for OCR errors) or use:
<foreign xml:lang="en">...</foreign> (for English words)
<choice><sic>...</sic><reg>...</reg></choice> (for apparent mistakes in
http://www.nzetc.org/ New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository