All lossless transcoders in spe...
It turned out that Ralph didn't actually need a lossless transcoder right
now after all, but his query led me to knuckle down to getting the xml-tcs
utility compiled on Windows as well as Linux, so anyone who wants a lossless
transcoder that works "off the shelf" is invited to contact me (see end)
A quick demo to illustrate the sort of thing it can do, so people can decide
whether they want it or not:
The source here is a utf-8-encoded document fragment containing instances of
c cedilla (not in US-ASCII, but present in ISO-8599-1) and an apostrophe,
encoded (please don't ask why!!) as a right single quotation mark, (a
character absent from both the US-ASCII and the ISO-8859-1 character sets.)
Lossless transcoding from utf-8 of a dictionary sub-entry for par de ça (in
Anglo-Norman) looks like this:
example 1: transcode to US-ASCII using xml NCR notation
[all characters mentioned above are rewritten as numeric character
references]
# ./tcs -n xml -f utf-8 -t us-ascii demo.xml
<form type="locution">
<orth>par de ça</orth>
</form>
<sense>
<trans>similarly</trans>
<eg>
<cit>
<quote>l’em prent de mauveis dettur aveine pur furment, ausi
par de
ça nostre seignur prent de nus noz lermes pur soen precious
sanc</quote>
<bibl id="AND-201-DFA90067">67.24</bibl>
</cit>
</eg>
</sense>
example 2: transcode to ISO-8859-1
[the c cedillas, being in the target character set also, are output in the
correct binary representation for the target encoding: only the "apostrophe"
is rewritten as a numeric character reference]
# ./tcs -n xml -f utf-8 -t iso-8859-1 demo.xml
<form type="locution">
<orth>par de ça</orth>
</form>
<sense>
<trans>similarly</trans>
<eg>
<cit>
<quote>l’em prent de mauveis dettur aveine pur furment, ausi
par de
ça nostre seignur prent de nus noz lermes pur soen precious sanc</quote>
<bibl id="AND-201-DFA90067">67.24</bibl>
</cit>
</eg>
</sense>
example 3 transcode to windows cp 1251
[not recommended in normal practice, but since this character set has a
right single quotation mark, but no c cedilla, the converse effect of
example 2 can be observed)]
# ./tcs -n xml -f utf-8 -t cp1251 demo.xml
<form type="locution">
<orth>par de ça</orth>
</form>
<sense>
<trans>similarly</trans>
<eg>
<cit>
<quote>l'em prent de mauveis dettur aveine pur furment, ausi par de

7;a nostre seignur prent de nus noz lermes pur soen precious sanc</quote>
<bibl id="AND-201-DFA90067">67.24</bibl>
</cit>
</eg>
</sense>
As I explained in an earlier posting, the unclear licencing position means
this utility should not be "distributed" via a generally-available download.
But I see no problem about making it available for academic purposes to
individual scholars who care to email me. I have binaries for both Linux
(libc6) and Win32 (NT, W2K, XP and probably Win9x, though the latter not
tested) and the source. The Win32 version is compiled against the standard
MS libraries (no Cygwin or suchlike needed). All incorporate a small bugfix
of mine to Rick Jelliffe's original xml-tcs patches (which before this fix
caused the lead byte of the original utf-8 sequence to be emitted after a
transcoded NCR).
Michael
---------------------------------------------------------
Michael Beddow http://www.mbeddow.net/
XML and the Humanities page: http://xml.lexilog.org.uk/
The Anglo-Norman Dictionary http://anglo-norman.net/
---------------------------------------------------------
|