> From: "David G. Durand (David G. Durand)" <[log in to unmask]>
> Message-Id: <"alfie.uib..359:16.01.95.20.56.37"@uib.no>
> In general this is impossible. *start SGML gripe* Sgmls gives only
> information in the ESIS, which hides certain critical pieces of
> information: non-system entity references are expanded out. Tags with
> an EMPTY content model are represented as empty start/end pairs, without
> annotation that they were empty. Including the end-tag would, of course,
> not be legal SGML.
> You'd also lose entity declarations in the DTD subset.
> It's a shame that ESIS, which has no official status, is so canonical in
> practice. The DSSSL document model is much better, but -- no public
> implementations as far as I am aware....
ESIS was developed for, and is part of, the Conformance Testing
standard which has been an ANSI standard for several years and became
ISO 13673 last year. Its purpose in life is *not* to provide a
normalized form of SGML or any sort of parser output that would be
necessarily useful for further SGML-based applications; it was to
provide a form of parser output intended to allow conformance testing,
and all conforming parsers are required to be able to emit this
specific form of output so that they can be tested for conformance.
SO ESIS does have formal status, and its use is "canonical" precisely
insofar as parsers must emit it, but no one but a conformance testing
application should expect to make productive use of it.
DSSSL is very new--the final IS doesn't yet exist--and relatively
complex even considering only simpler subsets. There are no implementations
yet, public or otherwise--how could one expect otherwise? The DSSSL
model for an SGML document was developed over *years* of DSSSL committee
work (in which I took part). I'm glad you think the model is good, though
in some ways it had to make arbitrary decisions too. We defined what we
called ESIS+ to include lots of things SGML applications like editors
might want, but we still omitted parts of MSIS like most markup minimization,
ignored characters, etc. I'm sure someday someone will gripe that they
want to be able to capture parts of MSIS that are not included in DSSSL's
definition of ESIS+.
> *end gripe, start semi-productive workarounds*
> On the bright side:
> You can special-case EMPTY elements at the CoST of a special script for
> each DTD (or set of TEI modifications) involving empty elements.
> You can map all your text-entities foo to the string "&foo;" with
> special entity declarations. (Or let SGML pass you the entity name as
> implied, and slap delimiters around it. I don't remember the details on
> this, so you'd have to check Goldfarb. I'm not even sure that "implied
> entity value" is the correct term).
It is interesting to note that the task of tools such as SGML-aware editors
(such as the tools available from SoftQuad, ArborText, and others) must
solve exactly the problems you are considering. For example, entity
structure, marked section structure, and probably comments should not
be lost and attribute default values should not be inserted. It is not
a trivial problem to combine valid 8879 parsing with capturing such
important non-ESIS information. It is also not well-defined just what
it means to normalize an SGML file in this way. If you like what a given
tool (e.g., SGML Editor) does by way of "normalization," I suggest you
make use of it. Otherwise, if you have a relatively simple definition
of what normalization means to you and you like to program and you have
an SGML parser, you can try to write something; but you should realize
that it isn't a necessarily well-defined, simple task.
VP Research Chief Technical Officer
ArborText, Inc. SGML Open
Email: [log in to unmask]
or [log in to unmask]