Print

Print


----------------------------Original message----------------------------
 
   Date: Sat, 17 Oct 1992 04:08:33 +0100
   From: Keld J|rn Simonsen <[log in to unmask]>
 
   in level 3 you are allowed to code the LATIN CAPLITAL LETTER A
   and a COMBINING ACUTE, but they do not constitute the letter
						     ^^^^^^^^^^
   LATIN CAPITAL LETTER A WITH ACUTE, this is also explicitely stated
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   in the standard.
 
And this is where you are incorrect.  The standard *absolutely does not*
state this.  This was never agreed to in Seoul.  I ask you to show me
where what you claim was agreed to.  Perhaps you are operating under the
mistaken assumption that the Danish comments asking for such a restriction
were adopted.  They were not.
 
   Thus there is no way to encode the character
   LATIN CAPITAL LETTER A WITH ACUTE as two characters according to
   the approved 10646 standard.
 
You are correct, two characters (code elements) of 10646 do not form
a character (code element) of 10646; however, two characters (code elements)
*may* encode a letter (text element) of any writing system that desires
to encode it thus.
 
You continue to conflate the term "character" with "letter."  In 10646
terms, a character is *merely* an element of a character set, and has
no necessary relation to letterhood.
 
You are correct that Levels 1 & 2 do not allow composing "letters" or
"natural characters" in this fashion; however, level 3 allows it at will.
The latter is what Unicode sanctions.
 
   And there are good reasons why ISO chose to specify it in this way.
   Allowing more encodings for the same character would have introduced
   a very complex and costly need for programming, for example when
   testing for equality of two strings, a big database specifying
   all the equalities have to be available, instead of the just
   byte for byte equality needed with the present standard.
   And this big specification of equality has not been specified
   precisely anywhere, not even in previous UNICODE standards.
 
Full Unicode systems *will* have to address this issue, as will any
level 3 implementation of 10646.  Such databases are cannot be avoided in
level 3 (you imply they can be avoided).
 
   >    So you only have to have getc() look at more than one code element,
   >    and you only have to test for one value when you look for this
   >    character, namely the accented character coded as one code element,
   >    and not for the comebined two-code entity.
   >
   > Most programs operate on text elements, which, in pre-10646 days
   > corresponded to code elements.
 
   I find this statement a bit out of reality. Most programs today
   operate on characters, I would only assume a few UNICODE programs to
   work on text elements.
 
No, most programs today work on code elements that just happen to (mostly)
correspond to text elements; full Unicode and level 3 10646 systems will
have to deal with the full generality of the code element != text element
equation.  Given that Microsoft NT, Apple Quickdraw GX, and other systems
are being built on Unicode (10646 level 3), I disagree that "few UNICODE
programs will work on text elements"; rather, I expect that the most popular
systems will over the course of the next few years attain a much higher
state of sophistication regarding text, one in which the abstraction of
text element over code element is as easy as current (limited) character
and text abstractions.  If older systems wish to limit themselves to level 1
or 2, then they can safely ignore this issue.  [Indeed, if you look closely
at the next paragraph which was in my previous message, you will see a
technique that even allows such systems to operate with level 3 10646
data.  Erik picked up on this quickly, I might add.]
 
   > getc() was designed in a context where
   > a code element could be equated with a text element.  With 10646
   > this situation has changed.  If an implementation desires to impose
   > text element/code element equivalence, then it must be prepared to
   > translate text elements which are spelled out by multiple code elements
   > into single code elements to be returned by getc().
 
   I believe this is some specifications coming from the previous version
   of UNICODE and not in line with the 10646 standard.
 
No, this is not coming from a previous version of Unicode.  It is coming from
the current version of Unicode and 10646.  I would suggest that you talk with
the 10646 editor (or WG2 convenor) if you aren't currently up on where things
stand.
 
   The 10646 standard has removed the very messy definitions of equivalences
   of combining characters with precomposed characters, which for one purpose
   could be equivalent and for another not equivalent in the previous
   UNICODE standard.
 
I too argued for rather serious revision of the language used in 10646
to describe the notion of code element combination.  Thankfully, it has
been cleaned up.  However, it does not remove the problem as you seem to
think (i.e., that the issue of equivalence of different spellings of
a letter, text element, or "character" in the naive sense has disappeared
- the same issue is still present in Level 3 and in Unicode, that's what
makes it level 3).
 
   The definition there did not live up to normal requirements of unique
   assignments of codes to characters, which is normally the case for
   character sets.
 
Yes, and that is why I argued against it.  It gave the incorrect impression
that a combination of code elements formed a "character" in the sense of
an element of a character set (i.e., a code element), which was patently not
true.  However, and this seems to be a problem for you to understand, the
principal of combining code elements to form "letters," "orthographic units,"
"text elements," or "characters" (in the naive sense), is still present
in Level 3 and Unicode.  These "natural characters," if I might call them
as such, won't necessarily satisfy the criteria of character-set-character
hood, since they won't have a unique codepoint, nor a name of their own.
However, they will still be used by level 3 implementations in order to
encode orthographic units - natural characters - of writing systems, whose
units aren't already present in 10646.  And, I might add, level 3 explicitly
permits any combination of combining marks to be used to create orthographic
units - natural characters - in this fashion.  The issue of "natural
character equivalence" or "orthographic unit equivalence" will not be
specified by 10646; this doesn't mean that other standards can't be created
that do specify specific uses of 10646 under level 3 usage, and define
equivalence in those terms.
 
   I expect this to be changed in the new UNICODE standard which is to
   be aligned with the approved 10646 standard.
 
No, it will not change, as the approved 10646 does not require such a
change.  If you would like to discuss all of this in more detail, I'd
certainly be glad to see you at the upcoming Unicode/10646 Implementor's
Workshop in Sulzbach, Germany.  I will be talking about the distinction
of code elements and text elements in quite a bit of detail in the
introductory tutorial.  Others, particularly Mark Davis, will be discussing
in a fair amount of detail the issues surround full implementations of
combining marks in level 3 systems.
 
Regards,
 
Glenn Adams