On 2 Apr 2013, at 20:08, Dana Dorman <[log in to unmask]>
> Is there a simple way to retain MS Word page breaks in transformations from .doc files to TEI P5 XML?
I see no problem in converting Word page breaks to <pb/> in general; getting them out of ODT should
be easy too, if needed.
I would very much recommend that you read the .doc files into Word, save as .docx, and then try
the OxGarage transformations on the result. Feeding a .doc file to OxGarage forces it to use OpenOffice
to convert .doc to .odt, and then .odt to TEI XML. That has two problems a) you can't rely on OpenOffice
to convert .doc 100% reliably, and b) OxGarage's conversion from odt to TEI XML is much less
mature than the docx to TEI XML.
you say "But when I try that option, I also lose the @n attributes showing 300+ footnote numbers in the document, so Iím not keen on going that route."
which puzzles me. If you send me a test file, I will endeavour to find out why it doesn't work as you expect.
PS don't believe anyone who tells you OpenOffice reads and writes .docx OOXML format reliably :-}
Director (Research) of Academic IT
University of Oxford IT Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431