> >if this is ... the proposed solution, i.e., 7-bit encodings, then SGML
> >will fail miserably: it must unequivocally demand 8-bit encodings
> >AKA ISO2022 at a minimum.
( ... )
> ... if SGML/TEI limits one to 7-bit encodings, then I'm going to unsubscribe
>from this list and forget about TEI and SGML.
A couple of postings have raised the issue of character encodings in TEI-SGML.
At the risk of putting words in the developers' mouths, I conjecture that it
originates in an unfortunate presentation causing a misunderstanding of the
TEI standards. In a posting of 1 October, Michael Sperberg-McQueen writes:
> The TEI recommendations for interchange of texts require conforming
> texts to contain only a subset of the characters included in ASCII.
>
> And therefore a TEI text is indeed ASCII-only, ( ... )
Since "ASCII" is commonly understood as a certain mapping between glyphs and
7-bit patterns, it is not surprising that this or similar statements appear
to raise the character representation issue. I understand the use of "ASCII"
for TEI purposes to be a little different, however: it is shorthand for a
character set made up of abstract glyphs, chosen so that among other properties
+ All have recognized graphic representations, and graphics for all are
present on all commonly used computer printers and terminals. (I gather
that a special case is the line-end, which has no graphic but is marked
by the change to a new printer or terminal line.)
+ All have recognized representations in 7-bit ASCII, and all commonly used
computer printers and terminals denoted as "ASCII devices" produce the
standard graphic from the 7-bit ASCII representation of each glyph
(this is the main sense in which they are "ASCII" characters).
+ All have recognized representations in EBCDIC.
The standard, I take it, requires that the text be entirely represented in
these abstract glyphs, NOT that it be in the coding convention "ASCII" --
otherwise, translation of a TEI-conforming text to EBCDIC would make it non-
conforming, which is elsewhere stated not to be the case. (Actually, a
printed representation would then be technically non-conforming, as the
glyphs would be represented by graphics rather than by ASCII bit patterns.)
The issue of 7-, 8-, 16-bit or other representation in TEI is then, I think,
no issue at all. I understand the TEI to be specifying NO representation,
but only the use of a certain set of abstract glyphs.
However, while I disagree that the issue of character representations
exists, there are clearly substantive issues behind it. I see two major
ones, but am now risking putting words into the mouths of TEI's questioners
as well as its developers, and welcome clarification by both.
First, the set of graphics in the TEI standard set is inadequate for even
the major European languages based on the Roman alphabet, let alone the
rest of the world's languages. I take it that the TEI developers have
recognized this from the start, and that SGML conventions are included
in the TEI standard so that expanded glyph sets as well as markup may be
represented in TEI-conforming files.
Second, even if the TEI glyph set and SGML conventions allow all text to
be represented, the small TEI glyph set makes such representation needlessly
clumsy compared to representations using larger glyph sets ("8-bit" and
"16-bit"). This is a real issue, and won't go away. I would personally
support the TEI standard using a restricted glyph set, since conforming
files may be displayed and used (though not conveniently nor as originally
intended) on any computer equipment likely to be available. However, clearly
the restricted representation is often clumsy, and the software to convert
it to a fuller representation on equipment with expanded character sets
may be so as well. My expectation is the development of a set of TEI
standards using a variety of glyph sets, with the properties that
+ All markup, including representation of glyphs not in the sets, use
a standard set of glyphs that are part of ALL the glyph sets
+ All glyphs in ANY of the sets have representations determined by the
TEI standard, using only the above shared set of glyphs. (Therefore,
text represented in any TEI standard is representable in any other
standard, and automatic translation from any standard to any other
is straightforward.)
Standards will then be set up using the glyphs sets of such 8-bit and
16-bit coding schemes as become sufficiently established to make the
effort worth while. The broad utility of such expanded-set standards,
however, will depend on there being graphic and machine-readable
representation of all glyphs in the standard that are as widely
recognized as the "7-bit ASCII" standard is today.
|