On Tuesday, April 08, 2003 4:35 PM
Tim Seid wrote:
> At the outset of a project creating a digital library of old Quaker texts
> for the 17th c on, is it necessary or advisable to have an expert in text
> encoding and literary texts develop a DTD to encode elements that might
> be available in TEI-Lite?
Ah well, it's another sign that TEI-lite should now be considered harmful if
you've gained the impression that the choice is between TEI-lite prêt à
porter or expert and expensive (re-?) invention of specialist extensions.
Far from it. TEI-lite will probably not suffice for a project of this kind.
But equally probably, the full tagset repertoire will offer almost
everything you need. You don't need an expert to bake your own TEI pizza.
Where you would indeed benefit from an expert is in deciding what to pick
from the chef's menu.
But be wary of outsourcing all your expertise. Invest some of your funding
in the acquisition by at least some of your in-house team of a sufficient
grasp of encoding in general and TEI in particular to be able to engage in
an intelligent and critical dialogue with any bought-in expert. There will
be all sorts of decisions, tactical as well as strategic, where the right
choice can only be made by someone who is steeped in the scholarly issues
specific to your project and yet sufficiently versed in the technicalities
of encoding to know, at least in broad terms, what is possible and
desirable. Such a person can't be bought in, but has to be home-grown.
> If the texts
> are further encoded later on, what does that do to the creation of a
> engine? Does the search engine have to be updated then every time a new
> element is made available within the database.
No search engine worthy of the name would be fazed by a change in document
structure. That means an indexing and retrieval system that understands the
structure of your documents as defined in the DTD and so can adapt to any
alterations which are duly reflected in that DTD. There are some search
systems (often dedicated to specific projects and sometimes created by
internal enthusiasts) that are not truly structure-aware, but treat SGML and
XML documents as plain text streams with funny angle brackets and equals
signs stuck in at odd places. These have to resort to smoke and mirrors to
look as though they understand the structure of the documents, and so can
indeed be quite easily broken if that structure changes and their bluff is
called.. But provided you avoid that kind of "solution", you should have no
worries on that score. Which doesn't mean it isn't wise to think long and
hard, fairly early on, about what sort of searches you envisage your users
wanting to perform, because that may help you make markup decisions that
help rather than hinder such searches.
> One more question? Some of the texts I'm talking about may have Greek,
> Hebrew, and Latin. Should we expect these to be done in unicode and
> appropriately? Some companies say they will use images for Greek and
> Hebrew rather than text for display.
There's an essential distinction to be made here between representing text
and rendering it. Yes indeed, you should use Unicode representations of your
Greek and Hebrew characters in your documents, otherwise you will have
unnecessary difficulties processing them (and that includes indexation and
retrieval), as well as laying down a legacy of interchange headaches for
those who inherit your data. But when it comes to rendering Greek and Hebrew
on the screen (especially shortish strings embedded in largely Latin-script
documents), it may well be sensible to use a delivery system that sends
graphic images instead of character codes from the underlying documents. It
depends on your intended user base and their equipment levels.
The online OED is an interesting example. The OED entries require a couple
of hundred different glyphs to render non-Latin characters (or Latin
characters with diacritics). OUP made a decision to commission a set of
graphic images for all of these glyphs and to send them out to Web clients
as graphical images embedded in the text stream. So every accented Latin
letter and all non-Latin characters you see in on-line OED entries is a
graphic. As a result, no special fonts are required to view the entries
exactly as intended on any machine. This has been done so skillfully that
many users are unaware that they are reading a mixture of text and graphics,
and that in fact only becomes visible if users change their default text
size (possibly to help with impaired vision). Then, the true text is
resized. but the graphics retain their fixed dimensions, so the text can
appear slightly odd to those who don't understand what is happening.
On the other hand, things have moved on since the on-line OED was designed,
and you may be confident that your users will all already have, or be able
easily to obtain, the necessary fonts to render all your text as text,
without any substituted graphics being needed. But in any case, to
substitute or not to substitute graphics is an issue that applies only to
the delivery system, and will not impinge on your actual encoding or editing
practices (though it may arise in a different form when you are considering
how your encoders and editors will input and review non-Latin text).