Örn, Sebastian and TEI-L,

Well I'm glad to be accused of over-egging the cake, if that's what it
takes. (A world of meaning is implied in that criticism -- which I
admit might be just -- even if it says nothing specifically. :-)

What Sebastian's idea has in common with mine is that the problem is
conceived in terms of transformations rather than as a hand-conversion
in which mapping policies can be (and probably will be, at least to a
great extent) specified only implicitly. Because they are effected
through automatic means, transformations will not only require you to
make your tagging decisions consciously, but also will expose any
inconsistencies and help document your decisions.

Where the two ideas differ is in my assumption that your intermediate
format should be clean, semantically rich, and as simple as possible
-- i.e. the farthest thing from WordML conceivable.

Now it may be that there wouldn't be that much difference when it came
down to it. This would be the case if there isn't enough consistency
of format and structure (and implicit representation of semantics
therein) in the original Word document for such an ideal format as
I've envisioned to be interpolated into it. I admit that even a little
inconsistency -- at least if it is regarded as information to be
captured and not errors to correct -- might be enough to make my cake

If that's the case, then Sebastian's idea of taking the material
through a loose (and semantically weak) TEI on the way to a stronger
one is attractive. It would have the advantage of exposing the issues
in a way more tractable to analysis and processing, even while it
allowed you to postpone hard choices.

However, if on the contrary the original Word is very consistent and
semantically strong, there are fewer hard choices to make (at least
with respect to what the stuff "is", if not always how to express it
in TEI), and you'd only postpone exposing the "actual" semantics
(latent, to be sure, in a presentational format such as Word) and
indeed risk losing them, on the way in or the way out.

You might take a chunk of your text -- a representative sample or
representative selection of samples (representing, that is, the range
of variation) -- and try both approaches to see which one works
better. The "fun" test should not be forgotten. In part, this is
because such a decision is best made in view of local contingencies,
such as who is doing the work, how they conceive of it, and what
skills they bring.


On Thu, Dec 20, 2012 at 5:59 AM, Sebastian Rahtz
<[log in to unmask]> wrote:
> I wonder if Wendell isn't over-egging the cake a bit here (appropriate, I suppose, for the
> time of year). Saving the bibliography file in  Word's OOXML format (docx)
> gives you an intermediate XML file in a rational form which is amenable
> to transformation.
> The problems Örn  has are clear
>  1. what _is_ this bit of info:  a page number, an author, a volume number
>  2. how do I separate it out reliably (likely by some complex regular expression)
>  3. what do I do when the compiler was inconsistent
> one might as well use the TEI vocabulary for 1., and in practice its sensible to do this
> in two passes, one to get the markup in place, the second to get it into the required structure.
> Staring at the example again, I am wondering heretically whether you might not
> abandon the attempt to turn it into a bibliography, and just make it a paragraph
> with lots of rich inline markup and cross-referencing. maybe do that and then
> extract a subset as a formal bibliography? but perhaps thats effectively what you
> have in mind with the extensive <note>
> I blanch at the sight of an <lb/> in the <title> of a <monogr> :-}
> --
> Sebastian Rahtz
> Director (Research Support) of Academic IT Services
> University of Oxford IT Services
> 13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Wendell Piez |
XML | XSLT | electronic publishing
Eat Your Vegetables