LISTSERV mailing list manager LISTSERV 16.5

Help for TEI-L Archives


TEI-L Archives

TEI-L Archives


TEI-L@LISTSERV.BROWN.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TEI-L Home

TEI-L Home

TEI-L  March 2010

TEI-L March 2010

Subject:

hyphenation (again) proposals

From:

Lou <[log in to unmask]>

Reply-To:

Lou <[log in to unmask]>

Date:

Wed, 24 Mar 2010 16:23:22 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (133 lines)

I'd like to get closure on the rather dispersed discussion we've been 
having here on the topic of the soft-hyphen character, how to handle 
hyphenation in old printed texts, and similar. I think some sort of a 
consensus does seem to have emerged, but I'd like to check that this is 
not just wishful thinking on my part. Here's a (rather longwinded) 
attempt at summarizing the consensus:

1. The Unicode character &#x0a; though useful as a means of indicating 
potential hyphenation points in a born-digital document should *not* be 
relied on as a means of encoding actual hyphenation points in 
non-digital source texts. That should be accomplished by means of 
explicit XML markup.

2. An important requirement when source line-breaks are recorded in some 
way is to distinguish between those which imply word-endings 
("word-breaking") from those which do not. There is a tacit assumption 
in XML that a newline character in text data is a kind of whitespace, 
and is therefore word breaking. XML mixed-space rules are notorious 
confusing on this issue, but in general it is *not* safe to assume that 
whitespace adjacent to markup tags is always going to be preserved. In 
particular,  it is *not* safe to assume that an XML markup tag within a 
string of (non whitespace) characters necessarily divides the string 
into distinct tokens.

3. The hyphen is used in two different ways in source texts: sometimes 
it forms part of a word, sometimes it indicates that -- contrary to what 
might be expected -- a word is not yet complete, but continues on the 
next line (or over the next page or column or other boundary). The 
former case cannot always reliably be distinguished from the latter, but 
most encoders would like some way of unambiguously recording the 
distinction when it *has* been made.

4. Orthography may also be affected by hyphenation. For example in 
Dutch, the word  "opaatje" if broken across the line would appear as 
"opa-tje" -- i.e. one of the "a"s would disappear.  It's not clear to 
what extent this is a matter of concern in the TEI community, but it 
clearly has to be dealt with.

5. In the response we received from the Unicode consortium about this 
issue (which I have put online at 
http://www.tei-c.org/Activities/Council/Working/Softhyphen.pdf if you've 
lost it) Eric Muller makes a distinction between "flowable" and "flowed" 
texts. He defines the former as "text independent of any realization 
into lines", and the latter as "actual realization of the text into 
lines" which is an intellectually coherent way of presenting the 
problem, though it does rather beg the big text-encoding question of 
what exactly is meant by "text independent of any realization".

I propose to sidestep that ontological debate by introducing the notion 
of tokenization (what I called "word-breaking") into the discussion, as 
above. In practice we all agree that the "text data" of which XML 
documents are composed is decomposable into smaller units called 
"words", even if those units are not explicitly indicated by the XML 
markup. Sometimes we need to be explicit about the tokenization rules we 
expect from a processor, (in the absence of explicit tagging) whether 
these are being applied in the context of a search engine looking for 
individual things to index, or a display engine trying to make thing 
look nice on a printed line.

6. So what should we do? I propose that the recommendation we should 
make (and which should find its way into the TEI Guidelines somewhere) 
is as follows:

* If you want to preserve the lineation of a source document, do so by 
means of the <lb> element. Similarly for pagination, using <pb> Do not 
assume that the linefeeds etc. in your document will be preserved.

* If you want to delimit *all* the words within your document do so 
using the explicit <w> element. There is however no point in using this 
element for just one or two problematic cases.

* If the words are not delimited by <w> elements,  a processor can 
decide for itself how to identify their boundaries, but the usual 
default (which the TEI therefore expects) is that whitespace characters, 
including newlines in the source, will indicate the end of a token.

* When, therefore, a newline in the source does NOT indicate the end of 
a token (whether or not this is indicated in some way eg by the presence 
of a hyphen) the encoder should (a) record a newline at the  end of the 
word affected  rather than inside it and (b) indicate where the  source 
line ending actually is by means of an <lb> element inside the word 
(again, this is assuming that white space  is the only available 
mechanism for tokenization)

7. We're already half (or more) of the way there with the currently 
recommended use for the @type attribute, from which I quote:

"The type attribute may be used to characterize the line break in any 
respect, but its most common use is to specify that the presence of the 
line break does not imply the end of the word in which it is embedded. A 
value such as inWord or nobreak is recommended for this purpose, but 
encoders are free to choose whichever values are appropriate."

So, finally, here are some examples of how I think we should handle the 
cases Eric discusses in his document (I've used \\ here to represent the 
newline)

A:  Mrs.\\Norris

The normal case. May be encoded just like that, or as Mrs.<lb/>Norris. 
Expectation is that there are two tokens "Mrs." and "Norris"

B. children--\\of

The -- represents an mdash, which, in this text, has no space before or 
after it. If tokenizing software understands that the mdash is a word 
separator (which it should), again there is no need to encode
this at all, and again an <lb/> can be safely introduced.

C. to-\\day

The hyphen here is to be preserved (on evidence elsewhere, this text 
regards "to-day" as a hyphenated word) whether it is at the end of a 
line or not. May be encoded as to-day\\ or, preferably, as to-<lb 
type="inWord"/>day\\

D. wonder-\\ful

The hyphen here is a "softie" which should be regarded as a renditional 
matter. May be encoded as wonderful\\ or, preferably, as wonder<lb 
rend="hyphen" type="inWord"/>ful\\  (Or we could define another possible 
@type value)

E. opa-\\tje

The hyphen has to be replaced by an additional "a" before tokenization. 
I don't know enough Dutch to know whether this is a case where we really 
need the full panoply of <choice> structures, or whether a quick fix like
opa<supplied cause="hyph">a</supplied><lb rend="hyphen" type="inWord"/>tje\\
would be acceptable.

I must say I feel a lot less confident about the last case than the others.

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

ATOM RSS1 RSS2



LISTSERV.BROWN.EDU

CataList Email List Search Powered by the LISTSERV Email List Manager