David Megginson suggests using sed or awk or ports thereof to translate
into and out of TEI. While this has its commendable sides, there are a
few trouble spots that are quite annoying even outside of TEI needs:
(1) The length of lines are restricted in both sed and awk.
(2) Both sed and awk operate on lines, which makes some parts of SGML
very difficult to describe and handle efficiently, and correctly.
(3) Neither handles 8-bit data very cleanly, be it binary or 8-bit text.
(4) Neither handles arbitrary binary data with context sensitive meaning,
such as found in many proprietary text representation systems.
(5) Both sed and awk are easy to use for simple tasks, but complex
problems get exponentially more complex to solve with sed, less so
(6) Both sed and awk are regular expression based. Regexps are powerful
yet get complex once you leave the character-orientation they have.
SGML is not character-oriented, but token-oriented, and use regular
expressions on tokens in the syntax. This can get arbitrarily
complex to represent in a character-based regular expression engine.
This is not to deride the value of awk or sed. I use awk to process
(not validate) simple SGML documents such as invoices and business
letters. I even used awk and sed to format and print an SGML document,
from SGML input to laser printer driving code output. It can be done,
but it usually requires multiple steps of sed and awk, and care must
be taken to "layer" the operations correctly so you handle everything.
Intermediate steps have to be designed. It's often easier to write up
something which builds on an SGML parser. There are a few SGML parsers
in the public domain, as well. NIST comes to mind.
Apropos on the topic of computer representations of text, I got a
chance to air my frustration with Macs today when talking to a graphic
designer and a typographer. They were so happy someone in the computer
business knew about typography and knew it was an artform you must
learn to master, not something which could spring out of a computer as
if it was instant knowledge. I got to plug SGML, telling them that
computer people could work with information, as they know, and the
typographers could work with the presentation, as they know, stressing
that each requires special knowledge, and that they could meet in a
language designed to separate the two. I think I got two new friends.
Naggum Software, Oslo, Norway