A little late to the party, but I'm with Martin:  assuming that your bounding boxes are defined as quadralaterals, you should be able to translate the OCR output into the existing facsimile/surface encoding, though I've not heard of any automated way of doing so. Couple that with a resp attribute pointing to a responsibility definition of your OCR engine and a certainty attribute reflecting the recognition confidence and you're in business. If the OCR output is defined in polygons... no such luck.

Encoding the output of multiple OCR engines, meanwhile, is an entirely different animal. Even with the resp and cert attributes, I don't think TEI really lends itself to the encoding of disconcensus.

On Fri, Jun 19, 2015 at 9:23 PM, Benjamin Kiessling <[log in to unmask]> wrote:

I'm working on OCR of Latin and Greek texts and looking for a more
flexible alternative to the common hOCR format. As our results get
converted to TEI/Epidoc finally anyway (and OCR itself could be
described as an epigraphic process) it would be somewhat fortuitous
if information like bounding boxes for lines, words, and graphemes,
recognition confidences, and script detection could be adequately
represented using already defined TEI primitives. In addition,
representing the output of multiple OCR engines including different
segmentations (word boundaries, columns, ...) would be desirable.

I've had a look at the P5 guidelines but couldn't find any
elements/attributes that could be utilized for these purposes without
some extremely creative coercion. So I'm looking for input on how to
achieve a non-contrived encoding of these features.

All Best,