> My answer to this is: convert [it] to Unicode.
Indeed. It can't be said loudly, insistently and universally enough that XML
means Unicode. There may be helpful parsers that accept other encodings as
input and helpful serialisers that will transcode back to non-Unicode
encodings before human eyes get to see the output again, but under the hood,
all conformant XML systems use Unicode internally, and if your abstract
characters in your documents don't have a Unicode mapping you are in trouble
(though the revised Guidlines I've already menationed will help between a
little and a whole lot, depending on your specific problem).
For those who know what SDATA entities are and how they can help with exotic
writing systems in SGML, *there is no SDATA support in XML*. Systems that
claim to be XML conformant and which support SDATA are NOT in fact XML
conformant systems. Useful maybe, but not XML, even if is says so on the
label. Use them at the peril of seriously damaging the exchangeability of
your documents in an XML world.
> Some people
> say that they love Unicode in principle but that the world's not ready
> for it yet
Indeed they do say that, ad nauseam, and generally they really mean that
*they* aren't ready for it, e.g. because they "have" to use Windows 98, or
because they think they ought to be able to swap UTF-16 encoded text between
Macs and PCs without paying attention to byte-order issues, or because they
do their software development under Borland Turbo Pascal, or because they
need to encode the Tokyo telephone directory. OK, the last group (only) have
a point, but I'm not going to get involved in Japanese Cultural politics.
Aagh! I Googled for Sambhota software, and it does indeed seem to be a
proprietary system. It's only 7.55 am here, so I think I'll go back to bed.
While I recuperate in resumed slumbers, behold this mind-boggling
proclamation from the bottom of the home page of the outfit responsible
Preparing for the Future
Nitartha is following developments in language technology to prepare Tibetan
language publishing for future technology adaptations, such as Unicode.
I'm not sure if signals from the Universe where most of us live can actually
reach such a highly-warped corner of spacetime (apparently in that backwater
called NYC -- those Mutant Ninja Turtles sure did a lot of damage, and not
just to parents' pockets) , but if you're reading this in Nitharta Land,
what you think is the future is our present and past, and has been for a
good few years now.
> Unicode will make your life easier for text-processing in the short
> term, and in the long term you'll save yourself a lot of work once the
> world *is* ready for it, and everyone will praise you for your wise
> foresight. And from a glance at Sambhota encoding it appears
> straightforward enough that a home-brewed transcoder will do the trick
> fairly painlessly.
> But I can say that because I haven't done a lot of
> transliterating/transcoding lately -- though when I did, it was
> reasonably painful.
> I've found
> which appears to describe something that doesn't exist yet.
> http://iris.lib.virginia.edu/tibet/tools/conv.html mentions
> http://www.babelstone.co.uk/Software/BabelPad.html (page's BabelPad
> link doesn't work), which can convert Extended Wylie to Sambhota, so
> if you can get your Sambhota to give you Wylie you may be able to do
> your conversion in two hops. But I've never done Tibetan and haven't
> tried any of this software.
> I've packaged the most recent milestone of my own indic-transcoding Perl
> module here:
> (Obliterator = "Ob"-ject-based trans-"literator"). It's licensed
> under the redistributor's choice of the GNU GPL or the Perl Artistic
> License. It's cranky, and stupid in some places, but it has worked
> for my needs of transcoding mixed roman and ISCII to UTF-8 (though for
> most uses I must recommend IBM's ICU/uconv for this purpose instead of
> Obliterator). As it is it won't help you with Sambhota in the least,
> but there it is. Contact me offlist (and be patient) if you want to
> go through its vagaries.
> I hope this helps,