Print

Print


Hi,

I've been struggling with such issues the last couple of weeks. We're 
partner in a correspondence edition project which has engaged a limited 
number of (elderly) volunteers who in a first phase will be checking 
existing transcriptions (in DOCX format), and possibly enrich them with 
more information. In order to stay as close as possible to both the 
volunteers' comfort zone and the existing Word files, we've started a 
first editing round in Word for which we've defined a limited number of 
Word styles, mainly to identify implicit text structures. Such styles 
will aid in a subsequent conversion to XML to get decent minimally 
structured TEI out of it. For later editing rounds, I /hope/ to get 
(some of) the volunteers comfortable with a tightly configured graphical 
XML editing environment. Since Oxygen Web Author was out of reach 
budget-wise, I'm exploring XMLMind (http://www.xmlmind.com/xmleditor/), 
which has a number of pluses for this project (projects like ours are 
covered by the free license, it's highly configurable, can work with 
remote files on Google Drive, can be offered fully (and hence centrally) 
configured as a downloadable program that doesn't need installation).

Introducing the volunteer group to a more structured way of "styling" 
Word documents proved challenging enough, but I hope this way of 
approaching the transcriptions can function as an introduction to 
working in a more structured graphical XML editing environment, with the 
added trigger that such an environment could unleash their full 
potential for enriching the transcriptions with richer information 
(annotations, named entities, additions / deletions / unclear readings, 
...) they'll be craving to add but don't have the means for in Word (see 
below). Additionally, since restructuring existing XML structures is IMO 
one of the hardest parts for novices (who in this case aren't interested 
in these encoding aspects anyway), the "structuring" phase in Word 
should allow us  to derive properly structured XML that will allow them 
to concentrate on further enrichment with more information. With proper 
configuration of the editor and if all goes well, this could then take 
the form of selecting text and applying the right action ("mark as 
deletion"), much like applying styles in Word.

Of course, the resulting TEI texts will be proofed and edited further by 
project staff who will be working with the actual XML code, most 
probably in Oxygen.

On 27/04/2017 15:42, Sewell, David R. (drs2n) wrote:
> On Thu, 27 Apr 2017, Martin Holmes wrote:
>
> [...]
>> XML is not hard. Word, by contrast, is a concoction of frustrations, 
>> and getting DOCX into decent TEI when you're done is horribly difficult.
>
> If you create a Word template with styles (whether built-in or custom) 
> that are sufficient to define each block-level and inline element that 
> needs to be expressed in XML, then it's certainly feasible to set up a 
> workflow involving a translation tool like oXgarage. It mainly 
> requires careful analysis of the result of oXgarage conversion in 
> order to create a further XSLT transform to produce your desired final 
> output. 

I agree, as long as there's no overlap in the text structures or 
phenomena you want to express with Word styles, since Word styles don't 
nest within other styles of the same level (paragraph or character). For 
example, if you have defined two paragraph-level styles for indicating 
verse lines and block quotations, you can't combine them for marking a 
verse line inside a block quotation, since only 1 paragraph-level style 
can be applied at the same time. Equally, if you have defined two 
character-level styles for additions and deletions, and try to mark a 
deletion inside an added text fragment, Word will instead fragment this 
into [first bit of text with style for addition] [text with style for 
deletion] [rest of text with style for addition]. You'll end up with a 
flat sequence of separate styles, which after conversion will translate 
into:

   <add>first bit of text with style for addition</add>
   <del>text with style for deletion</del>
   <add>rest of text with style for addition</add>

...where instead what you really want to express is:

   <add>first bit of text with style for addition
     <del>text with style for deletion</del>
   rest of text with style for addition</add>

I don't think it's possible to up-convert such fragments (in all their 
possible combinations) automatically in a meaningful way. The resulting 
TEI text would always need thorough checking and restructuring, which 
would cause extra work instead of gaining time. Even creating separate 
"synthetic" styles for such combinations (e.g. deletion-within-addition, 
verseLine-within-quotation, ...) would quickly become unwieldy (of 
course, all kinds of structures can nest in all kinds of combinations 
and levels) and merely complicate the Word step without substantially 
improving the resulting XML.

If my hopes come true, this could provide a pragmatic workflow for this 
project, where a Word step could both be useful and function as a 
didactic means towards a more structured editing environment.

Best,

Ron