Print

Print


A little late to the party, but I'm with Martin:  assuming that your
bounding boxes are defined as quadralaterals, you should be able to
translate the OCR output into the existing facsimile/surface encoding,
though I've not heard of any automated way of doing so. Couple that with a
resp attribute pointing to a responsibility definition of your OCR engine
and a certainty attribute reflecting the recognition confidence and you're
in business. If the OCR output is defined in polygons... no such luck.

Encoding the output of multiple OCR engines, meanwhile, is an entirely
different animal. Even with the resp and cert attributes, I don't think TEI
really lends itself to the encoding of disconcensus.

Misha

On Fri, Jun 19, 2015 at 9:23 PM, Benjamin Kiessling <[log in to unmask]
> wrote:

> Hi,
>
> I'm working on OCR of Latin and Greek texts and looking for a more
> flexible alternative to the common hOCR format. As our results get
> converted to TEI/Epidoc finally anyway (and OCR itself could be
> described as an epigraphic process) it would be somewhat fortuitous
> if information like bounding boxes for lines, words, and graphemes,
> recognition confidences, and script detection could be adequately
> represented using already defined TEI primitives. In addition,
> representing the output of multiple OCR engines including different
> segmentations (word boundaries, columns, ...) would be desirable.
>
> I've had a look at the P5 guidelines but couldn't find any
> elements/attributes that could be utilized for these purposes without
> some extremely creative coercion. So I'm looking for input on how to
> achieve a non-contrived encoding of these features.
>
> All Best,
> Ben
>