I'm not sure that Sebastian's suggestion of iconv would meet the original
request for "off the shelf" lossless transcoding, though it could doubtless
be made to do the job. So I offer a couple of amplifications, pending the
answers to my questions (which Christian has now in effect repeated):
A What this is all about
(anyone puzzled by terms like "abstract character", "code-point" etc may
find my earlier posting on NO-BREAK-SPACE and utf-8 useful)
1) We are here talking about a character transcoder. Most of what is to be
found on the WWW has to do with transcoders for binary data (especially
graphics) or communication protocols. References to character transcoders
consequently tend to be swamped.
2) "Lossless" character transcoding may need a word of explanation for those
who have encountered the term in the domain of compression technologies,
where it means something rather different. When going from one character
encoding to another, it sometimes happens that there is no code-point in our
target encoding for an abstract character in out input encoding. Some
transcoding utilities respond to this problem by substituting one and the
same output marker for each and every input codepoint they cannot transcode.
Often this marker is a pair of question marks or suchlike. Plainly, wherever
this happens, there is information loss, since all the marker shows us is
*that* an untranscodable chracter was present, not *which* character it was.
3) The appearance of a question mark or a rectangle or suchlike else when a
file is displayed after transcoding does not necessarily mean that there has
been transcoding loss. Some rendering systems output such placeholders
simply because they do not have a glyph to represent the abstract character
concerned. In such a case, the information may well have been correctly
transcoded in the underlying file (and so will still be processed correctly)
but is being lost at the rendering stage. This can be inconvenient or
cosmetically unappealing, but it does not signal damage to data integrity.
4) Because XML allows any abstract character to be represented by its
numerical character reference in Unicode using only characters in the 7-bit
ASCII range, "lossless" transcoding of XML documents is feasible even when
the target encoding does not contain a code-point corresponding to an
abstract character in the input. Where such an input character is
encountered, a lossless transcoder outputs a numeric character reference, so
that the source abstract character is still uniquely identified. If a
document containing such a numeric character reference to an abstract
character outside the document's character set is fed to e.g. a brower
compliant with HTML 3 or later, the browser will interpret the numeric
reference as a Unicode code point, and display the correct character if his
the system puts a suitable glyph at its disposal.
B The xml-tcs transcoder.
tcs was a utility offering "lossless" transcoding made available by Bell
Labs as part of its Plan9 Unix distribution. In its original Bell Labs form
(dating from 1995 with some revisions in 1997) tcs support for Unicode (1
only) was limited, and clearly it predated XML. In January 1999, Rick
Jellife made significant modifications to this utility; above all he added
the facility to generate hexadecimal renderings of "missing" characters with
a variety of user-specifiable delimiters, including those needed to create
numeric character references. A few months later he added a decimal
representation facility to make the utility more usable with SGML also.
There was, however, a snag. It was (and still is) unclear whether third
parties had the right to distribute either the Bell Labs source for tcs or
any modified source or binaries of their own that incorporated the original
Bell Labs code. Several institutions, firms and individuals specifically
enquired about this, but no response was ever received. That accounts for
the less than ideal way in which the Academia Sinica attempts to make the
modified tcs (xml-tcs) available. Since it is not clear they they or anyone
else are entitiled to distribute the orignal source from their site, they
simply offer a link to its location at Bell Labs. Users are then supposed to
fetch the "patch" file from the AS site ( a list of the changes RJ made to
the original source, in a format that the "patch" utility can automatically
apply to the source files, thus creating source for the modified utility).
Alas, the vital link to the source has now been dead for some time, because
the Plan9 distribution to which it once pointed is no longer available in
the same form. Assuming that the Bell Labs source can be tracked down
elsewhere, the problems are still not over. The mechanism for applying
patches, though essentially simple, must be one of the worst-documented
features of Unix systems. And even in the hands of someone who knows that
procedure, the patch file supplied for xml-tcs is in a format that will
produce a string of disheartening error messages if fed to a recent version
of the patch utility with the default settings.
I do myself have the original Bell Labs source, the patched xml-tcs source,
and a working i-686 binary (compiled against libc6) for Linux. I expect that
this stuff could also be compiled on Windows with the appropriate Cygwin/GNU
kit (life is too short to even attempt compiling it under MS C). I am not
sure about the Mac: the original tcs is in the Darwin-GNU distribution for
the Mac in binary form, but I can't find how to obtain the source from
Darwin's SourceForge area (presumably it's in their CVS somewhere).
For the reasons indicated above, I can't simply put these things on one of
my sites for anonymous download, since that would certainly amount to
"distribution". I am, however, pretty confident that it would fall under
fair dealing if I were to provide individual fellow academics with this
material if they specifically requested it for purposes of scholarship or
research, and I will happily respond to any emails that meet those
conditions. I can't however, offer support for the utility, which is not
without some problems.
Michael Beddow http://www.mbeddow.net/
XML and the Humanities page: http://xml.lexilog.org.uk/
The Anglo-Norman Dictionary http://anglo-norman.net/