LISTSERV 16.5 - TEI-L Archives

Subscriber's Corner

Email Lists

TEI-L Archives

TEI-L@LISTSERV.BROWN.EDU

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		TEI-L Home
		TEI-L February 1991

Subject:

Unicode 1.0

From:

[log in to unmask]

Reply-To:

Text Encoding Initiative public discussion list <[log in to unmask]>

Date:

Tue, 5 Feb 91 10:56:20 GMT

Content-Type:

text/plain

Parts/Attachments:

text/plain (80 lines)

I recently requested a copy of the draft spec of Unicode 1.0 character
encoding.  Although not able to give it all the time I'd have liked, my
brief look does raise a number of comments.  I'm grateful to have the
opportunity to plug my comments into the general discussion (via TEI,
HUMANIST and the UNICODE team themselves:[log in to unmask]).
 
(a)  There are a number of significant typos; is anyone keeping a master
record of these?
 
(b)  Robin Cover  <ZRCC1001@SMUVM1> has raised the question why there are
not separate encodings for Hebrew SIN and SHIN.  They are certainly at least
as distinct as, say, LATIN E followed by ACUTE and LATIN E ACUTE.  I take
it that the reason the latter case has two encodings is because of
previous ISO encodings; but since those are in any case ASCII encodings
(and Unicode is intended as a replacement for ASCII) how relevant is that?
The question also raises a more fundamental problem in my mind.  There
are a number of situations where a glyph (or conglomerate of glyphs) can
reasonably be encoded in alternative ways; HYPHEN (U+2010=U+002d) would be
a case in point.  We are told that some of these redundancies are there so
that natural pairing can be used "if desired" (page 6).  However, these coded
pairs are not consistently undertaken (eg CAPITAL DOTTED I).  But what worries
me is that two encodings of an identical text may thus turn out to be very
different; and for anyone using computer comparison of texts this could be
quite problematic.  So over against those who complained that, eg, separate
codings for GREEK ALPHA+GRAVE are not available I would voice the opposite
disquiet:  the encodings are too comprehensive.  If ALL accentuation was
added as a separate code I think comparison of texts would be easier.
 
The ordering of the accents would then of course be important, and I don't
think the algorithm given (centre-out) is terribly helpful; which is
nearest the cente in GREEK ROUGH BREATHING+ACUTE+IOTA SUBSCRIPT?
Wouldn't an additional algorithm (clockwise starting at twelve o'clock)
be useful?
 
(c) While we're on Greek, I couldn't find a Greek semicolon (raised dot).
Maybe I just didn't look hard enough, but full punctuation would be useful.
But see my comment (e) below.
I also failed to locate LATIN CAPITAL LETTER WYNN.
 
(d)  In general I approve of the policy that by adding the special Coptic
forms to the Greek alphabet one can generate Coptic text, with hard copy
generated by choosing an appropriate font.  (And mutatis mutandis for
other languages.)  However, there are some drawbacks to this policy; I
foresee the following problems:
  (i)  It may be necessary to indicate to someone (if only the compositor)
where to change font.  Could a coding for change-of-language be incorporated?
  (ii) In some Greek texts it may be important to indicate where ligatures
are used; there seems no way in this encoding to distinguish between
GREEK KAPPA + GREEK ALPHA + GREEK IOTA on the one hand and the ligature
which stood for "kai" on the other.  I am sometimes in the position of
needing to say (as indeed the authors of the manual were) something like
"There are three possible form of LATIN SMALL LETTER G CEDILLA (U+0123)
and they look like ..."  How could I encode my ellipsis?  Could the whole
of the manual as printed be sensibly encoded in Unicode?  Oddly, there are
some forms which are exclusively graphic variants (ie one would not find
them together in a "natural" text) which do attract separate codings;
GREEK SMALL LETTER SCRIPT THETA for instance.  Perhaps consistency is
unattainable, but to me it is a desideratum.
 
(e) The encoding of special numerals seemed odd.  AS well as a select
group of fractions (thirds, quarters and eighths, I think) there is the
top half of fractional 1/nnn (U+215f).  How is its use envisaged?  Wouldn't
a generalised "fractional line" be better (let's call it U+nnnn) so that
<number string1>nnnn<number string2> is to be interpreted as a fraction?
 
Similarly, Roman 12 (XII) is encoded as U+216b, but 13 (XIII) must be
(presumably) U+2169 2162.  Why not a single code for "roman numbers follow
 here:"
(or just use ROMAN CAPITAL LETTER X &c)?
 
If codes for general *modes* like "Greek font"; "roman numeral", "fraction"
were included, then many ambiguities and problems could be reduced.  My
Greek semicolon, for instance, could be "GREEK FONT + ;"
 
This contribution could be better thought-out, but it was this or nothing.
If the latter seems preferable; please discard!
 
Sincerely,
Douglas de Lacey.

Top of Message | Previous Page | Permalink

Search Archives

Advanced Options

Options

		Log In
		Get Password

		Subscribe or Unsubscribe