LISTSERV mailing list manager LISTSERV 16.5

Help for TEI-L Archives


TEI-L Archives

TEI-L Archives


TEI-L@LISTSERV.BROWN.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TEI-L Home

TEI-L Home

TEI-L  February 2015

TEI-L February 2015

Subject:

Re: Unicode Normalization

From:

Radu Coravu <[log in to unmask]>

Reply-To:

Radu Coravu <[log in to unmask]>

Date:

Fri, 20 Feb 2015 09:46:40 +0200

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (187 lines)

Hi Jens,

I tested and you are right, when you paste in Oxygen on Mac you always 
obtain the content with one codepoint per character, even if the initial 
copied content contained two codepoints per character. This is not 
something we can control though, we use standard API to get content from 
the clipboard and this is how we get it.
One workaround would be to use the Oxygen plugin for the Eclipse 
workbench. Eclipse accesses the clipboard in some other (probably more 
native) way than Swing applications so on Eclipse this behavior can no 
longer be reproduced.

About this remark:

> For oXygen, I think this would mean that the Find/Replace menu should have an option to search in a normalized manner.

I would add an improvement request for this and see if we can do 
something about it.

Regards,
Radu

Radu Coravu
<oXygen/>  XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com

On 2/19/2015 12:41 PM, Jens Østergaard Petersen wrote:
> Hi Radu,
>
> Thank you. I indeed am sorry I did not get to the bottom of this and
> misrepresented oXygen's capabilities.
>
> As you write, this happens on the system level. On Mac OS X, it appears
> to be the case that when one copies into a Java-based application, NFC
> normalization takes place. This does not happen with native Mac apps.
> What goes on in other operating systems I have no clue.
>
> This shows how dangerous it is to make assumptions in this area. Indeed,
> it appears that Unicode actually requires an application to
> treat canonically equivalent sequences as identical – which would make
> oXygen (and all text editors I know) non-conformant. See the discussion
> at <http://scripts.sil.org/cms/scripts/page.php?item_id=NFC_vs_NFD>. If
> all applications can NFC/NFD-transform at will if they only honour this
> requirement, it really makes no sense to require a TEI document to be in
> any specific normalization (or to be normalized at all). It puts the
> burden on the application, since it is then required to (at least make
> it possible to) find canonically equivalent strings. For oXygen, I think
> this would mean that the Find/Replace menu should have an option to
> search in a normalized manner. This I think is also Syd's argument –
> that search engines ought to be able to handle this. As far as I know,
> no application offers this option, so this is in no way a criticism of
> oXygen!
>
> I would have thought that XML canonicalization implied Unicode
> normalization, but it does not:
> <http://www.w3.org/TR/xml-c14n#NoCharModelNorm>.
>
> Best,
>
> Jens
>
> On 19 Feb 2015 at 09:12:21, Radu Coravu ([log in to unmask]
> <mailto:[log in to unmask]>) wrote:
>
>> Hi Jens,
>>
>> Oxygen does not have an internal conversion table between these formats
>> so whatever you paste in it or copy from it is interpreted as it is.
>> If you paste some content in Oxygen and you find NFC text, then that's
>> what Oxygen found in the system clipboard.
>>
>> Regards,
>> Radu
>>
>> Radu Coravu
>> <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
>> http://www.oxygenxml.com
>>
>> On 2/18/2015 4:06 PM, Jens Østergaard Petersen wrote:
>> > Hi Radu,
>> >
>> > Sorry that I got this wrong, but when you paste in text input outside of
>> > oXygen, surely an automatic conversion is performed? I get NFC text only
>> > when I paste in NFD text.
>> >
>> > The search problem you mention is my prime concern. I want to know if
>> > something is "there", so I need to know if which normalization form the
>> > TEI document is in.
>> >
>> > Best,
>> >
>> > Jens
>> >
>> > On 18 Feb 2015 at 14:22:36, Radu Coravu ([log in to unmask]
>> > <mailto:[log in to unmask]>) wrote:
>> >
>> >> Hi Jens,
>> >>
>> >> One remark about this:
>> >>
>> >> > Well, one reason might be that oXygen (like e.g. JEdit) surreptitiously converts all your input to NFC.
>> >>
>> >> I do not know about JEdit but Oxygen preserves the text exactly in the
>> >> way you type it. If you type it using the equivalent unicode character
>> >> for "Å", it will preserve it as a single code point. If you somehow
>> >> manage to type it as two code points, it will keep it as two codepoints.
>> >> So it depends on what the Swedish keyboard layout sends to the
>> >> application.
>> >> I'm attaching a sample file with both forms. I entered the composed form
>> >> using the Character Map utility from Oxygen.
>> >>
>> >> One problem is that when you search for such words in Oxygen, you need
>> >> to search exactly using the form in which they are present in the
>> >> document so Oxygen will not consider that the single codepoint "Å"
>> >> equals to codepoint "A" followed by codepoint "̊".
>> >>
>> >> Regards,
>> >> Radu
>> >>
>> >> Radu Coravu
>> >> <oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
>> >> http://www.oxygenxml.com
>> >>
>> >> On 2/18/2015 2:44 PM, Jens Østergaard Petersen wrote:
>> >> > TEI is concerned with text and its markup. What if there was something
>> >> > woolly in what we regard as text? Not with the interpretation of the
>> >> > text, but with it identity as a string of characters? What if I could
>> >> > input something and you could not find it, even though the abstract
>> >> > characters of your search term agreed one-to-one with my input?
>> >> >
>> >> > This is the problem of Unicode normalization
>> >> > <http://unicode.org/reports/tr15/#Norm_Forms>.
>> >> >
>> >> > Let us say you had input the string "Åström" (a Swedish name) in a TEI
>> >> > document – would I be able to find it with "Åström"? The first string
>> >> > uses precomposed characters for "Å" and "ö", so these only occupy one
>> >> > code point each. The second looks the same as the first – indeed, it is
>> >> > canonically identical – but "Åström" uses decomposed characters, with
>> >> > "A" followed by "̊" and "o" followed by "̈" – which characters nicely
>> >> > coalesce on the screen into "Å" and "ö" – and I cannot use it to find
>> >> > "Åström". The two approaches to encoding the word follow two different
>> >> > Unicode Normalization Forms, NFC ("C" for "composed") and NFD ("D" for
>> >> > "decomposed"). There are also two forms named NFKC and NFKD (with "K"
>> >> > for "compatibility), which e.g. splits apart ligatures such as "fi" into
>> >> > "fi". These of course also influence findability.
>> >> >
>> >> > In Guidelines, vi. Languages and Character Sets, it says:
>> >> >
>> >> > "It is important that every Unicode-based project should agree on,
>> >> > consistently implement and fully document a comprehensive and coherent
>> >> > normalization practice. As well as ensuring data integrity within a
>> >> > given project, a consistently implemented and properly documented
>> >> > normalization policy is essential for successful document interchange."
>> >> >
>> >> > I think this calls for an obligatory element in encodingDesc (perhaps
>> >> > "NFDecl") to register which normalization practice has been followed in
>> >> > a TEI document. Otherwise, no one will know what's in it ….
>> >> >
>> >> > How does one go about finding out which normalization practice has been
>> >> > followed in a stretch of text? That's only possible by analyzing it code
>> >> > point for code point, but there are applications that make it possible
>> >> > to convert a whole text to one of the four normalization forms. On the
>> >> > Mac, there is the great UnicodeChecker
>> >> > <http://earthlingsoft.net/UnicodeChecker/> which can analyze and compare
>> >> > text and allows you to convert text in a text editor through a service.
>> >> > On Windows, there is BabelPad
>> >> > <http://www.babelstone.co.uk/Software/BabelPad.html>.
>> >> >
>> >> > Why hasn't this issue surfaced before? Well, one reason might be that
>> >> > oXygen (like e.g. JEdit) surreptitiously converts all your input to NFC.
>> >> > Though I think all text people resent such automatic conversions,
>> >> > Unicode allows this, mainly with the (dated) motivation of achieving
>> >> > backwards compatibility with software that is not Unicode-aware, but it
>> >> > is not necessarily what you want, and oXygen supplies no methods to
>> >> > handle this basic text property. In many TEI projects, using NF(K)D
>> >> > would be a sensible approach (string comparisons and regex searches are
>> >> > easier and faster), though the Guidelines (with some reserve) recommend
>> >> > otherwise:
>> >> >
>> >> > "the Normalization Form C (NFC) seems to be most appropriate for text
>> >> > encoding projects".
>> >> >
>> >> > Jens
>> >> >
>> >> ------------------------------------------------------------------------
>>

Top of Message | Previous Page | Permalink

Advanced Options