Agreeing with everything Michael has said below I would add a note.
We are building a system to perform this task and testing it on some
multi-lingual dictionaries, Basque-English and English-Basque. We are using
version 1 (in C++) to do the testing while we build version 2 in JAva. The aim
of the system is to allow a user to use a typical entry to specify the entry's
structure via a GUI working directly on the entry. We preserve that
description of the structure through an FSA and then apply it to other
records. The user continues by seeing how the current structure describes each
entry and then corrects that structure for non-conformant entries working
successively through each entries in the RTF file. It is learning the rules of
structure by example. It is not a method that guarantees all structure can be
discriminated but it should give a good amount of economy of effort compared
to hand markup.
The concept is of course applicable to any document not just dictionaries. We
expect to have version II running in a month.
We have been searching for articles on this topic. If anyone has a
bibliography or a even some references we would be grateful for them.
Date: Wed, 17 Oct 2001 23:19:53 +0100
From: Michael Beddow <[log in to unmask]>
On Wednesday, October 17, 2001 9:09 AM
: "florentmom" wrote:
> Hello, i have a little question :
> - do you know a software to convert from word, rtf and/or xpress
> to xml using the teixlite dtd?
Actually it's a very big question. The simple answer is that there is no
such piece of software, and never could be, because the meta-information
in a word or rtf file is about the presentation of the document, whereas
the teilite dtd models document structure, so conversion can never be
mechanical. There are various tools that will have a go at wrapping Word
paras in <p>s, enclosing italicised characters in <i>'s etc, but that's
a long way from TEI even at its "litest".
Having said that there are text tool collections (TUSTEP) and
specialised re-writing languages (Omnimark) that will, with some
patience and effort, allow you to glean clues about structure from
presentation and do some elementary XML markup for you which can then be
further refined. If the Word files concerned haven't been created yet,
and you can define a carefully-graded set of character and paragraph
styles, and (the really hard part) ensure that the authors/transcribers
use thoses style with absolute consistency, this can get you off to a
reasonable start. If the documents already exist and were never styled
with a view to upconversion into strucural markup, you have a much
You might like to look at my piece on Digitising the Anglo-Norman
which describes one approach to this problem (the source documents are
in word, though the target tagset is not teilite but xml markup based on
P4 section 12) based on OpenOffice, Perl and XSLT.
Michael Beddow http://www.mbeddow.net/
XML and the Humanities page: http://xml.lexilog.org.uk/
The Anglo-Norman Dictionary http://anglo-norman.net/
The meaning of your communication is the response you get
------- End of Forwarded Message