Perl is probably the best public-domain tool for doing such
transformations, subject to the limitation that if you use it without a
real SGML parser as a front end you need to have very clean SGML (not
just valid, but eschewing many optional features).
For example, it helps immensely to stick to the TEI subset documented in
P3. Otherwise, your string matches will produce unexpected results,
because "<...>" is not always a start-tag, nor do all start-tags look
like "<...>". Among the things to watch for are:
cdata/rcdata marked sections
<![ CDATA [ tags are written with pointy brackets, like "<P>". ]]>
(i think tei doesn't have any cdata *elements*, so they won't be a problem)
ignored marked sections
<![ ignore [ probably this should be omitted from the generated HTML ]]>
entity references with imbalanced markup:
where the entity foo contain the corresponding "</P>"
minimization of all kinds: a string-match won't notice omitted tags, and
will have an interesting time identifying abbreviated ones like <>, </>,
<p<q<r/, and other forms.
To do a conversion that's not confused by these and many similar pitfalls,
the convertor must contain an SGML parser. But if your data is very
well-behaved you may do just fine with regex matching, etc.
That said, it's time to take my TEI hat off and put on my commercial one,
and mention that EBT's DynaWeb product does all this stuff. It keeps a fully
indexed representation of the actual SGML structure, and can serve out whole
documents, outlines, subtrees, or other useful parts on demand, converting
each from SGML to HTML of one or more flavors on the fly. Because the server
has full access to the SGML, clients can search it with full generality even
though all their client can receive may be HTML. DynaWeb is available under
our Academic Grant Program for free to qualified non-profit organizations
who want to do interesting stuff with SGML.
You can look at it running a bunch of places; pointers to more can be found
from our home page at www.ebt.com, but one good example is www.novell.com,
which put up over 100,000 pages of SGML documentation on the Web in about a
week (most of that time was spent converting the graphics to GIF from
whatever they'd been using; the documents themselves didn't change). For a
more TEI-specific case take a look at the Oxford Text Archive texts
available at www.ebt.com (thanks Lou B!) -- I'll apologize in advance for
the unpolished front-matter layout, I only spent a very little while on
mapping the tei header down to HTML.
Sr. System Architect, EBT
(but also long active in TEI)
[log in to unmask]