Print

Print


Hi,

I'm working on OCR of Latin and Greek texts and looking for a more
flexible alternative to the common hOCR format. As our results get
converted to TEI/Epidoc finally anyway (and OCR itself could be
described as an epigraphic process) it would be somewhat fortuitous
if information like bounding boxes for lines, words, and graphemes,
recognition confidences, and script detection could be adequately
represented using already defined TEI primitives. In addition,
representing the output of multiple OCR engines including different
segmentations (word boundaries, columns, ...) would be desirable.

I've had a look at the P5 guidelines but couldn't find any
elements/attributes that could be utilized for these purposes without
some extremely creative coercion. So I'm looking for input on how to
achieve a non-contrived encoding of these features.

All Best,
Ben