For what it’s worth, here’s a schematron pattern to sniff out non-NFC data:
<assert test=". = normalize-unicode(.)">All text needs to be normalized (NFC)</assert>
It doesn’t solve the interesting issues that have been raised, but it at least keeps my own house clean.
Editor in Byzantine Studies
202 339 6435
From: Jens Østergaard Petersen <[log in to unmask]<mailto:[log in to unmask]>>
Reply-To: Jens Østergaard Petersen <[log in to unmask]<mailto:[log in to unmask]>>
Date: Thu, 19 Feb 2015 11:41:26 +0100
To: <[log in to unmask]<mailto:[log in to unmask]>>
Subject: Re: Unicode Normalization
Thank you. I indeed am sorry I did not get to the bottom of this and misrepresented oXygen's capabilities.
As you write, this happens on the system level. On Mac OS X, it appears to be the case that when one copies into a Java-based application, NFC normalization takes place. This does not happen with native Mac apps. What goes on in other operating systems I have no clue.
This shows how dangerous it is to make assumptions in this area. Indeed, it appears that Unicode actually requires an application to treat canonically equivalent sequences as identical – which would make oXygen (and all text editors I know) non-conformant. See the discussion at <http://scripts.sil.org/cms/scripts/page.php?item_id=NFC_vs_NFD>. If all applications can NFC/NFD-transform at will if they only honour this requirement, it really makes no sense to require a TEI document to be in any specific normalization (or to be normalized at all). It puts the burden on the application, since it is then required to (at least make it possible to) find canonically equivalent strings. For oXygen, I think this would mean that the Find/Replace menu should have an option to search in a normalized manner. This I think is also Syd's argument – that search engines ought to be able to handle this. As far as I know, no application offers this option, so this is in no way a criticism of oXygen!
I would have thought that XML canonicalization implied Unicode normalization, but it does not: <http://www.w3.org/TR/xml-c14n#NoCharModelNorm>.