I'd like to get closure on the rather dispersed discussion we've been
having here on the topic of the soft-hyphen character, how to handle
hyphenation in old printed texts, and similar. I think some sort of a
consensus does seem to have emerged, but I'd like to check that this is
not just wishful thinking on my part. Here's a (rather longwinded)
attempt at summarizing the consensus:
1. The Unicode character 
 though useful as a means of indicating
potential hyphenation points in a born-digital document should *not* be
relied on as a means of encoding actual hyphenation points in
non-digital source texts. That should be accomplished by means of
explicit XML markup.
2. An important requirement when source line-breaks are recorded in some
way is to distinguish between those which imply word-endings
("word-breaking") from those which do not. There is a tacit assumption
in XML that a newline character in text data is a kind of whitespace,
and is therefore word breaking. XML mixed-space rules are notorious
confusing on this issue, but in general it is *not* safe to assume that
whitespace adjacent to markup tags is always going to be preserved. In
particular, it is *not* safe to assume that an XML markup tag within a
string of (non whitespace) characters necessarily divides the string
into distinct tokens.
3. The hyphen is used in two different ways in source texts: sometimes
it forms part of a word, sometimes it indicates that -- contrary to what
might be expected -- a word is not yet complete, but continues on the
next line (or over the next page or column or other boundary). The
former case cannot always reliably be distinguished from the latter, but
most encoders would like some way of unambiguously recording the
distinction when it *has* been made.
4. Orthography may also be affected by hyphenation. For example in
Dutch, the word "opaatje" if broken across the line would appear as
"opa-tje" -- i.e. one of the "a"s would disappear. It's not clear to
what extent this is a matter of concern in the TEI community, but it
clearly has to be dealt with.
5. In the response we received from the Unicode consortium about this
issue (which I have put online at
http://www.tei-c.org/Activities/Council/Working/Softhyphen.pdf if you've
lost it) Eric Muller makes a distinction between "flowable" and "flowed"
texts. He defines the former as "text independent of any realization
into lines", and the latter as "actual realization of the text into
lines" which is an intellectually coherent way of presenting the
problem, though it does rather beg the big text-encoding question of
what exactly is meant by "text independent of any realization".
I propose to sidestep that ontological debate by introducing the notion
of tokenization (what I called "word-breaking") into the discussion, as
above. In practice we all agree that the "text data" of which XML
documents are composed is decomposable into smaller units called
"words", even if those units are not explicitly indicated by the XML
markup. Sometimes we need to be explicit about the tokenization rules we
expect from a processor, (in the absence of explicit tagging) whether
these are being applied in the context of a search engine looking for
individual things to index, or a display engine trying to make thing
look nice on a printed line.
6. So what should we do? I propose that the recommendation we should
make (and which should find its way into the TEI Guidelines somewhere)
is as follows:
* If you want to preserve the lineation of a source document, do so by
means of the <lb> element. Similarly for pagination, using <pb> Do not
assume that the linefeeds etc. in your document will be preserved.
* If you want to delimit *all* the words within your document do so
using the explicit <w> element. There is however no point in using this
element for just one or two problematic cases.
* If the words are not delimited by <w> elements, a processor can
decide for itself how to identify their boundaries, but the usual
default (which the TEI therefore expects) is that whitespace characters,
including newlines in the source, will indicate the end of a token.
* When, therefore, a newline in the source does NOT indicate the end of
a token (whether or not this is indicated in some way eg by the presence
of a hyphen) the encoder should (a) record a newline at the end of the
word affected rather than inside it and (b) indicate where the source
line ending actually is by means of an <lb> element inside the word
(again, this is assuming that white space is the only available
mechanism for tokenization)
7. We're already half (or more) of the way there with the currently
recommended use for the @type attribute, from which I quote:
"The type attribute may be used to characterize the line break in any
respect, but its most common use is to specify that the presence of the
line break does not imply the end of the word in which it is embedded. A
value such as inWord or nobreak is recommended for this purpose, but
encoders are free to choose whichever values are appropriate."
So, finally, here are some examples of how I think we should handle the
cases Eric discusses in his document (I've used \\ here to represent the
newline)
A: Mrs.\\Norris
The normal case. May be encoded just like that, or as Mrs.<lb/>Norris.
Expectation is that there are two tokens "Mrs." and "Norris"
B. children--\\of
The -- represents an mdash, which, in this text, has no space before or
after it. If tokenizing software understands that the mdash is a word
separator (which it should), again there is no need to encode
this at all, and again an <lb/> can be safely introduced.
C. to-\\day
The hyphen here is to be preserved (on evidence elsewhere, this text
regards "to-day" as a hyphenated word) whether it is at the end of a
line or not. May be encoded as to-day\\ or, preferably, as to-<lb
type="inWord"/>day\\
D. wonder-\\ful
The hyphen here is a "softie" which should be regarded as a renditional
matter. May be encoded as wonderful\\ or, preferably, as wonder<lb
rend="hyphen" type="inWord"/>ful\\ (Or we could define another possible
@type value)
E. opa-\\tje
The hyphen has to be replaced by an additional "a" before tokenization.
I don't know enough Dutch to know whether this is a case where we really
need the full panoply of <choice> structures, or whether a quick fix like
opa<supplied cause="hyph">a</supplied><lb rend="hyphen" type="inWord"/>tje\\
would be acceptable.
I must say I feel a lot less confident about the last case than the others.
|