Hi,
On 06/29, Misha Broughton wrote:
> A little late to the party, but I'm with Martin: assuming that your
> bounding boxes are defined as quadralaterals, you should be able to
> translate the OCR output into the existing facsimile/surface encoding,
> though I've not heard of any automated way of doing so.
I'm currently drafting a short document describing the encoding of OCR
results using the embedded transcription scheme inside a sourceDoc
element. As we/I develop our own OCR pipeline [0] incorporating all free
OCR engines and a fork of ocropus [1] there is no need to push this
format upstream although widespread adoption of an alternative to hOCR
would be desirable.
> Couple that with a resp attribute pointing to a responsibility
> definition of your OCR engine and a certainty attribute reflecting the
> recognition confidence and you're in business. If the OCR output is
> defined in polygons... no such luck.
> Encoding the output of multiple OCR engines, meanwhile, is an entirely
> different animal. Even with the resp and cert attributes, I don't think TEI
> really lends itself to the encoding of disconcensus.
I've come to the conclusion that choosing from multiple segmentations is
probably too complex a problem and creating the OCR based on the same
page segmentation, i.e. it's possible with our pipeline to run
ocropus/kraken on a page segmentation from tesseract. Combining
these already aligned outputs seems more reasonable. I was planning to
use resp attributes in combination with alt elements to encode
divergences in recognition.
All Best,
Ben
[0] http://openphilology.github.io/nidaba
[1] http://kraken.re
|