Print

Print


----------------------------Original message----------------------------
 
   Date: Fri, 16 Oct 1992 01:27:35 +0100
   From: Keld J|rn Simonsen <[log in to unmask]>
   X-Charset: ASCII
   X-Char-Esc: 38
 
   >    > What should the system software do when a program expects A WITH
   >    > ACUTE to be one code element, and not two?  Will uses of getc()
   >    > have to be replaced by gets() in order to return more than one
   >    > code element?  Or will getc() have to compose sequences into
   >    > precomposed code elements (but then getc() would have to look
   >    > at all the character code elements, etc.)?
   >
   >    Those well known questions should have been answered before 10646 was
   >    standardized through which 10646 could have been modified to be a much
   >    more usable standard.
 
   The question is answered in the forthcoming 10646 standard: if you want
   the character LATIN CAPITAL LETTER A WITH ACUTE you can only code it as
   one code element. The two code elements LATIN CAPITAL LETTER A and
   COMBINING ACUTE ACCENT do not together constitute a character
   LATIN CAPITAL LETTER A WITH ACUTE.
 
I can't figure from this whether Keld is being intentionally misleading
or whether he isn't aware of the details of 10646.  If by "character,"
Keld means an element of 10646, then he is correct:  a combination of
characters in 10646 does not create another "character," that is,
if by "character" one means an element of 10646.  However, if by
"character" one means an element of a writing system (or of an alphabet),
then Keld is quite wrong, since, indeed, one can arbitrarily form a
"character" in the sense of an element of a writing system by combining
code elements in 10646.  So, if I have an alphabet which has the element
LATIN CAPITAL LETTER A WITH ACUTE, I am completely free to encode this
as either one or two code elements.  In this sense, LATIN CAPITAL LETTER
A WITH ACUTE constitutes a text element in the context of some text
process and writing system.  A user of 10646 is quite free to encode
such a text element with more than one code element or with alternative
code element spellings.
 
   So you only have to have getc() look at more than one code element,
   and you only have to test for one value when you look for this
   character, namely the accented character coded as one code element,
   and not for the comebined two-code entity.
 
Most programs operate on text elements, which, in pre-10646 days
corresponded to code elements.  getc() was designed in a context where
a code element could be equated with a text element.  With 10646
this situation has changed.  If an implementation desires to impose
text element/code element equivalence, then it must be prepared to
translate text elements which are spelled out by multiple code elements
into single code elements to be returned by getc().  Since 10646 doesn't
encode all possible combinations of code elements to be treated thus
as single (precomposed) code elements, such an implementation must be
prepared to dynamically assign elements from the Private Use Zone in
order to represent unencoded composite text elements.  Such a system
must also be able to convert such private encodings into public encodings
when interchanging text.
 
Glenn Adams