Hi Syd,


In case it is useful, about 8 years ago I did some conversions for someone of omnipage OCR XML output to TEI. See zip file at http://james.blushingbunny.net/omnipage2TEI/ I probably wouldn't write it in the same way nowadays (both the xslt and the form of TEI output). Don't know if anyone is using that format any more or not.


Many thanks,

James 


--

Dr James Cummings, [log in to unmask]

School of English Literature, Language, and Linguistics, Newcastle University


From: TEI (Text Encoding Initiative) public discussion list <[log in to unmask]> on behalf of Syd Bauman <[log in to unmask]>
Sent: 17 August 2018 14:23:32
To: [log in to unmask]
Subject: OCRed XML samples
 
Hey, all. I'm looking to get my hands on sample XML output of a few
OCRed pages in various formats. The formats I know of are:
 * ALTO [1]
 * ABBYY Fine Reader [2]
 * PAGE [3]
 * hOCR [4]
but since I never do any significant OCR myself, I don't know which
ones are good vs bad or common vs rare, and thus if these are even
the ones I should be asking for.

SO, if you have relatively easy access to OCR software that produces
XML output, and would not mind sending me a sample, please get in
touch off list. Thank you!

P.S. Why? To work on providing crosswalks from the OCR XML formats
     to the TEI in Libraries version 4.0 level 1 encoding. Thanks
     again.

Notes
-----
 [1] https://www.loc.gov/standards/alto/
 [2] https://abbyy.technology/en:features:ocr:xml
 [3] Had trouble finding informal definition, but schemas are at
     http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2017-07-15/pagecontent.xsd
 [4] https://en.wikipedia.org/wiki/HOCR