On 15-02-19 01:07 PM, Jens Østergaard Petersen wrote:
> Hi Martin,
> Well, I'm no computer scientist, but I think that a text editor should
> accept and store all valid Unicode input, that is, that using
> both precomposed characters and that using base characters followed by
> combining characters. Needless to say, this also implies that it should
> allow you to edit what you have input, as it was input. So I don't agree
> with your SIL quotation.
I don't agree with it either; that was my point.
> However, if the text editor does something with
> your input, it should give you the option of basing itself on a
> normalization of your input, as well as on the raw input. Now, what is
> doing here? Well, finding identical strings surely belong in this
> category, but so does sorting.
Sorting depends on collations, and collations start with normalization:
This normalization is to Form D, not C.
However, just to make things more interesting, "Conformant
implementations may skip this step in certain circumstances: see Section
6.5, Avoiding Normalization for more information."
> Here Unicode appears to prescribe
> normalization only, but one should, I think, have the option of working
> both with the raw input and with normalized input,
I think we have to be careful not to use "normalized" to mean one
specific normalization form, though, don't we? NFD, NFC, NFKD and NFKC
are all normalization forms.
> finding all versions
> that that Swedish name or only the identical one. Of course, the text
> editor should also give you the option of normalizing a whole text, for
> in some cases (regex involving accents), NFD might be more convenient.
> For me, precomposed characters are one of those unhappy compromises
> that Unicode agreed to in order to enable round tripping in the early
> days and so boost acceptance. After the initial lot, they were not
> introduced any more, since they go against Unicode principles.
I think that's a different argument, although it's an interesting one.
> On Thursday, February 19, 2015, Martin Holmes <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
> Hi Jens,
> Thanks for raising this interesting topic. I was struck by this:
> it appears that Unicode actually requires an application to
> treat canonically equivalent sequences as identical
> In the case of (for instance) a search engine, I can imagine what
> this means; but in the case of an editor, I can't imagine what it
> could mean. If I type an e followed by an acute accent into my
> application, then press the backspace key, what should it do? Should
> it delete only the accent (the last codepoint entered), allowing me
> to enter a grave accent to correct an error; or should it somehow,
> in the background, normalize the combination to an e+acute, so that
> the backspace deletes the composite character? In either case, it
> would be privileging one normalization form over the other, and so
> not "treat[ing] canonically equivalent sequences as identical".
> The discussion you point to raises this very issue, but (to my mind)
> doesn't resolve it at all:
> "Whether a particular language group think of a combining mark as a
> separate letter or not may have an impact on appropriate user
> interface behavior, e.g., what happens when the backspace key is
> pressed, but that behavior needs to operate whether or not the data
> is stored as composed or decomposed."
> This seems to mean that the software should attempt to guess what I
> expect to happen when the backspace key is pressed, and do that.
> Fiddling with my iPad ...