Print

Print


Jens, my working code is below. Hope this saves you (and others) some time
hunting through the Unicode charts.
jk

=======
Schematron rule (with companion xsl function) to locate and identify all
non-normalized Unicode characters, and to offer a quick fix to normalize
it. Code must be part of a valid Schematron file. The prefix sqf must be
bound to the namespace
http://www.schematron-quickfix.com/validator/process. The prefix func can
be bound to any namespace.

   <rule context="text()">
      <let name="this-raw-char-seq" value="tokenize(replace(.,'(.)','$1
'),' ')"/>
      <let name="this-nfc-char-seq"
value="tokenize(replace(normalize-unicode(.),'(.)','$1 '),' ')"/>
      <let name="this-non-nfc-seq"
         
value="distinct-values($this-raw-char-seq[not(.=$this-nfc-char-seq)])"/>
      <assert test=". = normalize-unicode(.)"
sqf:fix="normalize-unicode">All text needs to be
         normalized (NFC). Errors: <value-of
            select="for $i in $this-non-nfc-seq return concat($i,' (U+',
            func:dec-to-hex(string-to-codepoints($i)),') at ',
            string-join(for $j in index-of($this-raw-char-seq,$i) return
string($j),' ')),' '"
         /></assert>
      <sqf:fix id="normalize-unicode">
         <sqf:description>
            <sqf:title>Convert to normalized (NFC) Unicode</sqf:title>
         </sqf:description>
         <sqf:stringReplace match="." regex=".+"><value-of
select="normalize-unicode(.)"
            /></sqf:stringReplace>
      </sqf:fix>
   </rule>

   <xsl:function name=³func:dec-to-hex" as="xs:string"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <!-- Input: Integer. Output: Hexadecimal equivalent string. -->
      <xsl:param name="in" as="xs:integer"/>
      <xsl:sequence
         select="if ($in eq 0)
         then '0'
         else
         concat(if ($in gt 16)
         then func:dec-to-hex($in idiv 16)
         else '',
         substring('0123456789ABCDEF',
         ($in mod 16) + 1, 1))"
      />
   </xsl:function>



From:  Jens Østergaard Petersen <[log in to unmask]>
Date:  Wed, 13 May 2015 08:10:54 +0200
To:  <[log in to unmask]>, <Kalvesmaki>, Joel <[log in to unmask]>
Cc:  <[log in to unmask]>
Subject:  Re: oXygen support for Schematron Quick Fixes


This sounds very interesting. Could you publish your QuickFixes for
normalising Unicode?

In this connection note also that oXygen 17 has added the possibility to
search for canonically equivalent strings. This allows one to search for
precomposed and decomposed characters at the same time, but as far as I
can see, it does not include compatibility distinctions, so (in the terms
of our earlier discussion), it works with ³Åström²/³Åström², but not with
³woffle"/³woffle².

Jens
 
On 13 May 2015 at 02:16:13, Kalvesmaki, Joel ([log in to unmask]) wrote:
 
TEI community,


Since it hasn¹t yet been mentioned, I thought it worthwhile to highly
recommend oXygen 17¹s new feature providing support for Schematron Quick
Fixes (http://www.schematron-quickfix.com).


Some of you may recall an earlier discussion on normalizing Unicode, and
the Schematron pattern I offered. That was fine insofar as it identified
and located the problem, but it offered no fixes. SQF does just that.
Tonight I put together a very simple but powerful SQF that allows a user
with two mouse clicks in oXygen to change the errant text of an element
into normalized Unicode. I wrote three more SQF patterns to fix editing
that was previously took around a minute per change (to look up a value,
copy, return where I was originally, and paste it). I think the potential
benefit to TEI projects, especially in communicating choices and options
to project participants, is quite impressive.


Kudos to Syncro Soft! (See their demo video here:
http://www.oxygenxml.com/demo/Schematron_Quick_Fixes.html )


Best wishes,


jk

--

Joel Kalvesmaki

Editor in Byzantine Studies

Dumbarton Oaks

202 339 6435