Print

Print


On Mon, 15 Apr 2002 Wendell Piez <[log in to unmask]> wrote:
> At 05:44 AM 4/13/2002, Sebastian Rahtz wrote:
> >yes, but going through XSLT is a high price to pay:
[...]
> This is a fair warning. Having to work around or make up for the lapses
> (listed) would argue against the approach if you have to do this very
> often. But for a one-off, it may be less pain than engineering a
> "proper" solution.
>
> An interesting case where XSLT is the hack.

Here's a Perl-script (below) which does the job of
converting UTF-8 encoding to XML character references.
As you may know, Perl is available for all platforms.
It's a slightly enhanced version of a script found
on Roman Czyborra's site (http://czyborra.com/utf/).
You may choose between hexadecimal/decimal references
by setting the output format.

Regards,
Hans

Hans van Mourik <[log in to unmask]>
Digitale Bibliotheek voor de Nederlandse Letteren <http://www.dbnl.org/>

#!/usr/bin/perl -p
# Convert UTF-8 encoding to XML character references

## Output Hexadecimal references (like &#x20AC; for Euro)
$format='&#x%04X;';

## Output Decimal references (like &#8364; for Euro)
#  $format='&#%04d;';

s{([\xC0-\xDF])([\x80-\xBF])}
 {sprintf($format,
  unpack("c",$1)<<6&0x07C0|unpack("c",$2)&0x003F)
  }ge;

s{([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])}
 {sprintf($format,
  unpack("c",$1)<<12&0xF000|unpack("c",$2)<<6&0x0FC0|unpack("c",$3)&0x003F)
  }ge;

s{([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])}
 {sprintf($format,
  unpack("c",$1)<<18&0x1C0000|unpack("c",$2)<<12&0x3F000|
  unpack("c",$3)<<6&0x0FC0|unpack("c",$4)&0x003F)
  }ge;

#########################################################################