Initially I was very keen on Jens's idea; I have XSLT which performs
Unicode normalization on some of my project files.
Then when I thought again about it, it strikes me that such a feature
would be one of those devilish things which may be completely unhelpful
because it may not be true.
We've all come across the problem of <revisionDesc> having lots of
helpful <change> elements, but many other changes having been made by
folks who neglected to create a <change> element, meaning that
<revisionDesc> is an incomplete and misleading record of what actually
happened -- which is worse than no record at all.
I can easily imagine a teiHeader element where I dutifully record that
the text and attribute node content in my document is NFC; then an
encoder copy/pastes a piece of non-NFC text into the document; and now
the declaration is worse than useless, because it's actually misleading.
Since you can easily algorithmically check whether a document is NFC or
not, and if necessary convert it to NFC, I don't believe it's helpful
(and in fact may be misleading) to have an explicit declaration in the
header for this.
On 15-02-18 07:26 AM, Syd Bauman wrote:
> This is a fascinating, if gnarly problem, that has been around since
> the dawn of XML. And some hold that this is really an XML problem,
> not a TEI problem, since (obviously) DocBook and MODS and all sorts
> of other XML vocabularies have the same problem.
> That said, even though the issue was addressed in early drafts of
> Canonicalization, the W3C (IIRC) explicitly dodged this bullet when
> it set up the C14N format. (Hang on ... there it is: see
> I just looked at that paragraph, and it does point out that the XPath
> data model requires NFC *when the input is not UCS-based*. I bet 90+%
> of TEI documents are UCS-based (e.g., UTF-8), though. Not sure what
> this means for those documents.
> Part of me thinks that search engines simply should know how to
> handle this. They have an option for case-folding ("A" vs "a"), e.g.,
> why not for pre-composition?
> In any case, it is a worthy enough idea (IMHO) that it should
> certainly be addressed, so I think you should put in a feature
> request ticket for this. (If you don't want to fight with the
> Sourceforge interface to do that, just say so, and I'll be happy to
> put the ticket in for you.)