Print

Print


Paul Bennett wrote:

> On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <[log in to unmask]>
> wrote:
>
>> Paul Bennett wrote:
>>
>>> It's only a small subset of Unicode that gets mangled, rather than every
>>> character (we've seen it on the Georgian alphabet, notably), at least
>>> with
>>> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte
>>> codes, only some of which lie within the deadly 128-150 range.
>>
>>
>> Ah, so it's only the Unicode characters that contain bytes matching
>> ASCII control characters with the 8th bit set that get mangled. Okay.
>
>
> Not Unicode characters. UTF-8 strings, which are not the same thing. A
> UTF-8 string can be one or more bytes long (bytes which are kinda supposed
> to be "safe" bytes to pass), and resolves mathematically to a single
> Unicode character. See my example below.

I was using "character" to mean a sequence of bytes corresponding to an
abstract Unicode character. I do understand how Unicode works.

>>> Should anyone post in pure UTF-16, I imagine the problem might manifest
>>> itself more often, especially if they use the right (or wrong?) Unicode
>>> pages.
>>
>>
>> Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls.
>
> Nulls that any sensible software[*] would simply either skip or print as a
> non-spacing space.

*Really* sensible software wouldn't be treating UTF-16 as ASCII. :P But
yeah.