On 2015-06-30 01:49 AM, Misha Broughton wrote:
> A little late to the party, but I'm with Martin: assuming that your
> bounding boxes are defined as quadralaterals, you should be able to
> translate the OCR output into the existing facsimile/surface encoding,
> though I've not heard of any automated way of doing so.
Yesterday at the TEI Hackathon in Sydney, we managed to create a working
process that starts from pre-prepared page-images (with filenames that
sort in the correct sequence) and does this:
1. Runs Tesseract to create OCR files from them in HOCR format.
2. Runs XSLT that turns those files into a single TEI file, with
surfaces, zones and with the OCRed text tagged in divs, ps and lbs,
linked to the zones.
3. Runs XSLT on that result directly in the browser to create a simple
HTML view with each page image arranged alongside its OCRed
transcription for basic proofing and correction.
It's very quick-and-dirty, but the code along with our test files are here:
Currently only works on *nix with ant, Saxon and Tesseract installed. It
might be a good starting-point for someone serious about setting up such
a workflow for a large document collection.
> Couple that with
> a resp attribute pointing to a responsibility definition of your OCR
> engine and a certainty attribute reflecting the recognition confidence
> and you're in business. If the OCR output is defined in polygons... no
> such luck.
> Encoding the output of multiple OCR engines, meanwhile, is an entirely
> different animal. Even with the resp and cert attributes, I don't think
> TEI really lends itself to the encoding of disconcensus.
> On Fri, Jun 19, 2015 at 9:23 PM, Benjamin Kiessling
> <[log in to unmask] <mailto:[log in to unmask]>> wrote:
> I'm working on OCR of Latin and Greek texts and looking for a more
> flexible alternative to the common hOCR format. As our results get
> converted to TEI/Epidoc finally anyway (and OCR itself could be
> described as an epigraphic process) it would be somewhat fortuitous
> if information like bounding boxes for lines, words, and graphemes,
> recognition confidences, and script detection could be adequately
> represented using already defined TEI primitives. In addition,
> representing the output of multiple OCR engines including different
> segmentations (word boundaries, columns, ...) would be desirable.
> I've had a look at the P5 guidelines but couldn't find any
> elements/attributes that could be utilized for these purposes without
> some extremely creative coercion. So I'm looking for input on how to
> achieve a non-contrived encoding of these features.
> All Best,