Print

Print


----------------------------Original message----------------------------
Glenn,
 
I need to clarify a point.  You write:
 
|   On the other hand, if one were to use the full names of 10646, a
|   file may be quite unwieldy in its size due to the enormous expansion
|   required to convert non-ISO646 character references to entity names.
 
This is not what I suggested, and Keld has argued against this, from his
point of view, too.
 
What I suggest is that we _describe_ a character by means of its unique
name, as in the character set declaration I posted.  Here, character
number 248 when G1 is invoked into the right half is "LATIN SMALL LETTER
O WITH STROKE".  If we don't have ISO Latin 1 available to an SGML
application, we can use an entity reference, and the name is immaterial
if we define it in terms of the full name:
 
	<!ENTITY oe SDATA "LATIN SMALL LETTER O WITH STROKE">
 
After this declaration, we can use the entity reference "&oe;" to access
the character.  Using Keld's example, we wouldn't write:
 
	Keld J&LATIN-SMALL-LETTER-O-WITH-STROKE;rn Simonsen
but
	Keld J&oe;rn Simonsen
 
The SGML parser will resolve this reference for us, and if the
application has defined a display version of the entity set, for
instance as in
 
	<!ENTITY oe SDATA "&#248">
 
if he has Latin 1 capability in the display engine, it will come out
right.
 
It's crucial to understand the difference between the _definitional_ and
the _display_ version of character entity sets.  I'm addressing the
problem of using unique names to bind characters to entity definitions.
 
Having done this, and having an application character set, or code
sequences to accomplish a given glyph on the display device, it's
trivial to produce a mapping by name lookup.  E.g., if  TeX is used as
the processing back-end:
 
Definition:
	<!ENTITY oe SDATA "LATIN SMALL LETTER O WITH STROKE">
 
Local mapping:
	"LATIN SMALL LETTER O WITH STROKE" = "\{o}"
 
Produces a display version:
	<!ENTITY oe SDATA "\{o}">
 
We can also use Keld's mnemonic encoding (as long as we stick to ISO
Latin 1, it's a good idea, and well done):
 
	<!ENTITY oe SDATA "<o/>">
 
|   Personally, I think folks should be thinking about concrete syntaxes
|   whose baseset is ISO10646, rather building systems based on the
|   reference concrete syntax.  Of course these two concrete syntaxes
|   are isomorphic by means of entity referencing.  But we should really
|   be building full 10646 syntaxes.  Document transfer can easily be
|   accomplished by means of appropriate transformation methods.
 
This is what I'm doing, already.  Howerver, believing that text entry
systems will be ISO 10646 compliant within the next few billion dollars
of software sales is a pipe dream.  Therefore, we need a "charactser set
manager" which can read any character data stream, compliant with ISO
2022 or IBM CDRA, or whatever, and let the parser see it as pure and
undiluted ISO 10646.  Passing to the application, we need to invoke the
character set manager once again to convert the internal representation
(ISO 10646) to whatever the application will understand.
 
SGML already supports understanding a document character set based on a
syntax reference character set, but it's not powerful enough to describe
a data stream encoding.  (That's why the TEI needs a "writing system
declaration", for instance.)  I think this should not be handled by the
application, but by a general utility between the general utility we
know as the entity manager and the parser.  Thus, the parser will use
ISO 10646 as its document _and_ syntax reference character set.
 
SGML cannot, however, communicate very well with the application on the
applications' terms when it comes to character set, and the ESIS as
defined does not include _any_ information about character sets (boo!
hiss!), so something needs to be done, in this area, too.
 
It's my intention to bring this kind of change about in SGML II, the
sequel.
 
I hope I have managed to clarify my design.
 
Best regards,
</Erik>
--
Erik Naggum             |  ISO  8879 SGML     |      +47 295 0313
                        |  ISO 10744 HyTime   |
<[log in to unmask]>        |  ISO 10646 UCS      |      Memento, terrigena.
<[log in to unmask]>       |  ISO  9899 C        |      Memento, vita brevis.