Print

Print


----------------------------Original message----------------------------
>    The question is answered in the forthcoming 10646 standard: if you want
>    the character LATIN CAPITAL LETTER A WITH ACUTE you can only code it as
>    one code element. The two code elements LATIN CAPITAL LETTER A and
>    COMBINING ACUTE ACCENT do not together constitute a character
>    LATIN CAPITAL LETTER A WITH ACUTE.
>
> I can't figure from this whether Keld is being intentionally misleading
> or whether he isn't aware of the details of 10646.  If by "character,"
> Keld means an element of 10646, then he is correct:  a combination of
> characters in 10646 does not create another "character," that is,
> if by "character" one means an element of 10646.  However, if by
> "character" one means an element of a writing system (or of an alphabet),
> then Keld is quite wrong, since, indeed, one can arbitrarily form a
> "character" in the sense of an element of a writing system by combining
> code elements in 10646.
 
Well, I cannot figure out if Glenn is being intentionally misleading
about the facts of the new 10646. ISO 10646 is a character set
standard, it defines characters, and in this respect 'characters' are
a well defined concept. It does not mean 'an element of a writing
system' and 10646 does not define elements of a writing system.
The 'elements of a writing system' is not well defined in ISO
standards, and should thus be used with care as a concept.
 
> So, if I have an alphabet which has the element
> LATIN CAPITAL LETTER A WITH ACUTE, I am completely free to encode this
> as either one or two code elements.  In this sense, LATIN CAPITAL LETTER
> A WITH ACUTE constitutes a text element in the context of some text
> process and writing system.  A user of 10646 is quite free to encode
> such a text element with more than one code element or with alternative
> code element spellings.
 
Well, this is as far as I know a misrepresentation of the forthcoming
10646 standard. You are not 'completely free' to encode LATIN CAPITAL
LETTER A WITH ACUTE in the ISO 10646 standard, and you cannot code this
letter as two characters according to the standard. At level 1 and 2
of the standard it is explicitely forbidden to use combining
characters for this purpose, but in level 3 you are allowed to code
the LATIN CAPLITAL LETTER A and a COMBINING ACUTE, but they
do not constitute the letter LATIN CAPITAL LETTER A WITH ACUTE,
this is also explicitely stated in the standard. Thus there is no
way to encode the character LATIN CAPITAL LETTER A WITH ACUTE as
two characters according to the approved 10646 standard.
 
And there are good reasons why ISO chose to specify it in this way.
Allowing more encodings for the same character would have introduced
a very complex and costly need for programming, for example when
testing for equality of two strings, a big database specifying
all the equalities have to be available, instead of the just
byte for byte equality needed with the present standard.
And this big specification of equality has not been specified
precisely anywhere, not even in previous UNICODE standards.
 
>    So you only have to have getc() look at more than one code element,
>    and you only have to test for one value when you look for this
>    character, namely the accented character coded as one code element,
>    and not for the comebined two-code entity.
>
> Most programs operate on text elements, which, in pre-10646 days
> corresponded to code elements.
 
I find this statement a bit out of reality. Most programs today
operate on characters, I would only assume a few UNICODE programs to
work on text elements.
 
> getc() was designed in a context where
> a code element could be equated with a text element.  With 10646
> this situation has changed.  If an implementation desires to impose
> text element/code element equivalence, then it must be prepared to
> translate text elements which are spelled out by multiple code elements
> into single code elements to be returned by getc().
 
I believe this is some specifications coming from the previous version
of UNICODE and not in line with the 10646 standard. The 10646 standard
has removed the very messy definitions of equivalences of combining
characters with precomposed characters, which for one purpose
could be equivalent and for another not equivalent in the previous
UNICODE standard. The definition there did not live up to normal
requirements of unique assignments of codes to characters, which is
normally the case for character sets. I expect this to be changed in
the new UNICODE standard which is to be aligned with the approved 10646
standard.
 
Keld Simonsen