Paul Bennett wrote: > On Fri, 30 Apr 2004 17:38:02 -0700, Garth Wallace <[log in to unmask]> > wrote: > >> Paul Bennett wrote: >> >>> It's only a small subset of Unicode that gets mangled, rather than every >>> character (we've seen it on the Georgian alphabet, notably), at least >>> with >>> UTF-8. UTF-8 is not merely raw Unicode, but rather a set of multi-byte >>> codes, only some of which lie within the deadly 128-150 range. >> >> >> Ah, so it's only the Unicode characters that contain bytes matching >> ASCII control characters with the 8th bit set that get mangled. Okay. > > > Not Unicode characters. UTF-8 strings, which are not the same thing. A > UTF-8 string can be one or more bytes long (bytes which are kinda supposed > to be "safe" bytes to pass), and resolves mathematically to a single > Unicode character. See my example below. I was using "character" to mean a sequence of bytes corresponding to an abstract Unicode character. I do understand how Unicode works. >>> Should anyone post in pure UTF-16, I imagine the problem might manifest >>> itself more often, especially if they use the right (or wrong?) Unicode >>> pages. >> >> >> Yeah, UTF-16 interpreted as ASCII would be chock-full of nulls. > > Nulls that any sensible software[*] would simply either skip or print as a > non-spacing space. *Really* sensible software wouldn't be treating UTF-16 as ASCII. :P But yeah.