Thank James and Dot
My next question is how to identify the tags as automatically inserted (as opposed to manually inserted and thus more valuable). My first idea was to use:
<unclear reason="unknown" resp="#error-detection-script-version-023">...</unclear>
But @resp can only point to a person and not a bot. Another alternative is to use:
<unclear reason="unknown" evidence="error-detection-script-version-023">...</unclear>
<unclear reason="unknown" evidence="script">...</unclear>
Does that make more sense?
I think I'd agree with Dot. The text was unclear (to the OCR) and thus
it has made its best guess so <unclear> would make sense. Another
option, I suppose, since it is an apparent error would be <sic>, but
that does strike me as a bit of tag abuse. <sic> is intended for when
you've consciously made the editorial decision to include a segment of
text which appears in the original as such, but which you know (and thus
wish to mark) as being erroneous. I'd have used either unclear myself,
or some neutral markup like <seg> with @type and @subtype to catergorise it.
Dot Porter wrote:
> What about using <unclear>, perhaps with some sensible value for
> @reason? <unclear reason="ocr">?
> On Mon, Jun 8, 2009 at 1:17 PM, stuart yeates<[log in to unmask]> wrote:
>> We've got quite a bit of recently acquired text in Maori and there are a
>> number of easily detectable errors (because we know more about the language
>> than the encoders). The texts often include both English and Maori, and
>> project work has the effect of making Maori text 'valuable' (i.e. it's used
>> for linguistic analysis). Currently we mark apparent errors with:
>> <foreign xml:lang="en">...</foreign>
>> Which has the effect of removing the apparently erroneous fragment from
>> linguistic analysis, because this is only done on xml:lang="mi" fragments.
>> However, usually these are OCR errors in Maori rather than words/sentences
>> in English.
>> Is there a better tag for this?
>> Properties that would be great would be (a) the ability to keep track of
>> automatically inserted tags (so for example they could be removed prior to
>> processing by an updated version of the script without inferring with
>> manually inserted tags); (b) not privileging one language over another; (c)
>> the ability to be added in a single pass over the text without the need to
>> store the entire document in memory (i.e. no requirement for a list of tags
>> in the header); and (d) preferably not using the 'n=""' attribute, which
>> we're already using for too many different things in too many places.
>> Once we have time to look at errors, we can either correct the actual text
>> (for OCR errors) or use:
>> <foreign xml:lang="en">...</foreign> (for English words)
>> <choice><sic>...</sic><reg>...</reg></choice> (for apparent mistakes in the
>> Stuart Yeates
>> http://www.nzetc.org/ New Zealand Electronic Text Centre
>> http://researcharchive.vuw.ac.nz/ Institutional Repository
Dr James Cummings, Research Technologies Service, University of Oxford
James dot Cummings at oucs dot ox dot ac dot uk