Oxygen does not have an internal conversion table between these formats
so whatever you paste in it or copy from it is interpreted as it is.
If you paste some content in Oxygen and you find NFC text, then that's
what Oxygen found in the system clipboard.
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
On 2/18/2015 4:06 PM, Jens Østergaard Petersen wrote:
> Hi Radu,
> Sorry that I got this wrong, but when you paste in text input outside of
> oXygen, surely an automatic conversion is performed? I get NFC text only
> when I paste in NFD text.
> The search problem you mention is my prime concern. I want to know if
> something is "there", so I need to know if which normalization form the
> TEI document is in.
> On 18 Feb 2015 at 14:22:36, Radu Coravu ([log in to unmask]
> <mailto:[log in to unmask]>) wrote:
>> Hi Jens,
>> One remark about this:
>> > Well, one reason might be that oXygen (like e.g. JEdit) surreptitiously converts all your input to NFC.
>> I do not know about JEdit but Oxygen preserves the text exactly in the
>> way you type it. If you type it using the equivalent unicode character
>> for "Å", it will preserve it as a single code point. If you somehow
>> manage to type it as two code points, it will keep it as two codepoints.
>> So it depends on what the Swedish keyboard layout sends to the
>> I'm attaching a sample file with both forms. I entered the composed form
>> using the Character Map utility from Oxygen.
>> One problem is that when you search for such words in Oxygen, you need
>> to search exactly using the form in which they are present in the
>> document so Oxygen will not consider that the single codepoint "Å"
>> equals to codepoint "A" followed by codepoint "̊".
>> Radu Coravu
>> <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
>> On 2/18/2015 2:44 PM, Jens Østergaard Petersen wrote:
>> > TEI is concerned with text and its markup. What if there was something
>> > woolly in what we regard as text? Not with the interpretation of the
>> > text, but with it identity as a string of characters? What if I could
>> > input something and you could not find it, even though the abstract
>> > characters of your search term agreed one-to-one with my input?
>> > This is the problem of Unicode normalization
>> > <http://unicode.org/reports/tr15/#Norm_Forms>.
>> > Let us say you had input the string "Åström" (a Swedish name) in a TEI
>> > document – would I be able to find it with "Åström"? The first string
>> > uses precomposed characters for "Å" and "ö", so these only occupy one
>> > code point each. The second looks the same as the first – indeed, it is
>> > canonically identical – but "Åström" uses decomposed characters, with
>> > "A" followed by "̊" and "o" followed by "̈" – which characters nicely
>> > coalesce on the screen into "Å" and "ö" – and I cannot use it to find
>> > "Åström". The two approaches to encoding the word follow two different
>> > Unicode Normalization Forms, NFC ("C" for "composed") and NFD ("D" for
>> > "decomposed"). There are also two forms named NFKC and NFKD (with "K"
>> > for "compatibility), which e.g. splits apart ligatures such as "ﬁ" into
>> > "fi". These of course also influence findability.
>> > In Guidelines, vi. Languages and Character Sets, it says:
>> > "It is important that every Unicode-based project should agree on,
>> > consistently implement and fully document a comprehensive and coherent
>> > normalization practice. As well as ensuring data integrity within a
>> > given project, a consistently implemented and properly documented
>> > normalization policy is essential for successful document interchange."
>> > I think this calls for an obligatory element in encodingDesc (perhaps
>> > "NFDecl") to register which normalization practice has been followed in
>> > a TEI document. Otherwise, no one will know what's in it ….
>> > How does one go about finding out which normalization practice has been
>> > followed in a stretch of text? That's only possible by analyzing it code
>> > point for code point, but there are applications that make it possible
>> > to convert a whole text to one of the four normalization forms. On the
>> > Mac, there is the great UnicodeChecker
>> > <http://earthlingsoft.net/UnicodeChecker/> which can analyze and compare
>> > text and allows you to convert text in a text editor through a service.
>> > On Windows, there is BabelPad
>> > <http://www.babelstone.co.uk/Software/BabelPad.html>.
>> > Why hasn't this issue surfaced before? Well, one reason might be that
>> > oXygen (like e.g. JEdit) surreptitiously converts all your input to NFC.
>> > Though I think all text people resent such automatic conversions,
>> > Unicode allows this, mainly with the (dated) motivation of achieving
>> > backwards compatibility with software that is not Unicode-aware, but it
>> > is not necessarily what you want, and oXygen supplies no methods to
>> > handle this basic text property. In many TEI projects, using NF(K)D
>> > would be a sensible approach (string comparisons and regex searches are
>> > easier and faster), though the Guidelines (with some reserve) recommend
>> > otherwise:
>> > "the Normalization Form C (NFC) seems to be most appropriate for text
>> > encoding projects".
>> > Jens