Print

Print


Let me argue both sides of this case:
 
Firstly, I have an action from a recent TEI workshop to come up with
an example of a minimal TEI-conformant text.  I'll certainly take
heed of the discussion taking place in this thread while I'm
doing the drafting.
 
But secondly, take a look at the way most people manipulate texts with
computers.  They use more-or-less WYSWIG word-processors, with the
result that what they see on the screen and on the printed page is a
representation of the (generally layout-orientated) semantics
represented by the data in a file, rather than a dumb dump of each byte
in the file.  Certainly, it is possible to manipulate text using dumb
dump methodologies: those of us who use TeX or troff do it as a matter
of course.  But we have to be pretty motivated (and/or set in our ways)
to use TeX or troff these days: most computer users, if these tools
were all that were available, just wouldn't bother to process words at
all.  Similarly, desk-top publishing would never have caught on if the
only way to achieve it had been to use a dumb editor to create
typesetting tapes or PostScript.
 
So it is, I contend, with any mark-up, SGML-based or not,
TEI-conformant or not, which adds any significant amount of information
to that carried by the words of a text alone.  Certainly, as a newcomer
to this field of study, I find that most of the ad hoc and mutually-
incompatible mark-up schemes developed over the years by researchers in
the humanities are pretty efficient at submerging the words in the
underlying text -- although TEI-conformant mark-up can excel in this
task by allowing so many classes of features to be described in such
detail, giving greater scope for obfuscation than mark-up schemes
addressing more limited domains of interest.
 
Let me emphasise: the problem is not the TEI qua TEI; it is the desire
for a portable mark-up scheme which has to provide for the needs of all
users of marked-up texts, rather than simply catering for the specific
needs of the researcher capturing the text.  Switching from, say,
WordPerfect to Word because you have moved to an employer who has
decided to standardize on the latter for document interchange is
painful.  Similarly portability of electronic texts has a price which
must be paid by those who capture them -- particularly those who are
used to using their own favourite, non-portable, format.  But
portablity also has a pay back: researchers should have to capture
fewer texts themselves because they can more easily re-use texts
captured by others.  If you don't believe the pain is worth the gain,
then you can ignore the recommendations of the TEI.  But it is part of
the TEI's job to try to convince you that the pain level is not that
great -- hence the need for examples.
 
There remains the problem of not being able to see the wordy wood for
the mark-up trees.  We must learn from the word-processor experience
and show the user a comprehensible representation of the contents of
the file, rather than one glyph for each character code in the file.
Of course, this is possible for any mark-up scheme, and has been done
for some -- ICE, for example.  The advantage of using an SGML-based
mark-up is that the tools that one needs to get started on the job are
available commercially -- have been for several years now -- and the
TEI has been able to negotiate attractive academic licence rates for
some of them.  (As an aside, with the introduction of Wordperfect
MarkUp, an SGML-aware add-on for the world's best-selling
word-processing package, those of us who live to make strange marks in
text are in danger of being sucked into the world of WYSIWIG anyway.)
 
A growing number of tools -- to which early TEI-conformant projects
such as the British National Corpus may be expected to add -- are also
freely available.  These tools should be easily adapted to any
TEI-conformant mark-up, so fostering the the reuse of electronic texts
that is the motivation behind the TEI.
 
You may have noticed me sidling into the future conditional.  I admit
that not all of these things are with us yet -- although a lot can
already be achieved with current SGML-aware editors.  There is also the
problem of the conversion of existing mark-up schemes into a form which
can take advantage of such tools.  On the British National Corpus
project, we have had a lot of trouble converting from unverifable
mark-ups (that is mark-ups for which no syntax checker exists), to a
mark-up which may trivially be checked for syntactic correctness by an
SGML parser.  The process brings to light errors in electronic source
texts which had hitherto gone undetected, and which must be corrected
by hand, an expensive process which often requires reference to the
original printed text.  (Apart from anything else, it's rather
disheartening to realise that a text that one has painstakingly
captured suffers from syntactic errors.)  We have not been able to use
SGML-aware editors for this task, as, in our experience, they can only
be used to create new documents which conform to a particular document
type definition; they cannot be used to bring initially
syntactically-incorrect texts into line with a DTD.  I suspect that
anybody with a body of electronic texts would experience similar
problems, and this is a disincentive to conversion.
 
A related issue is that of the conversion of existing tools, such as
taggers and indexers, to be aware of -- and eventually to make
intelligent decisions based upon -- SGML mark-up.  This has yet to be
done.  Once it has been done, the resulting tools will be more powerful
than those we have today, and much more widely applicable.  But it
hasn't been done yet -- unless, in the case of an indexer, you have
deep enough pockets to afford Basis' rates.  (I'm sure other vendors
jump into the discussion if they have applicable products.)
 
These and other issues will figure in a session, Problems in producing
a large text corpus, at the forthcoming AHC/ALLC conference.  I hope,
though, that the benefits of conversion for old texts, and capture in
TEI-conformant format for new texts, will become apparent as time
passes.
---
Dominic