> It is not clear to me how an automated system
> will know which are hard and which are soft hyphens.
>  the three people in the room were all ex-
>  convicts, but their behaviour was ex-
>  ceptional
> if you see what I mean.

Others cleverer than I will have their ways, but
in my experience hard can be sorted out from soft
only (a) by a few fairly dodgy rules of limited
applicability, e.g. if the second element is
capitalized, it is probably a hard hyphen
("Pilgrim-Soul"). And (b) by use of a hyphenation
dictionary. When converting the Middle English
Dictionary, with its two-column, heavily hyphenated
format, I ended up attempting to resolve EOL
hyphens only in the definition, i.e., in the
Modern English, sections, not in the ME quotations;
and did so by laborious means. We proceeded volume
by volume, extracting hyphenated words from the
definitions, deciding each case individually, and
converting the results to a script. The script
was then applied to the next volume, the surviving
examples were extracted, resolved, and added to the
script, and so on. I believe we also extracted
line-internal hyphenated compounds and perhaps even
line internal non-hyphenated words to further
'inform' the growing script.

As far as the ME quotations go, and our current
work with early printed books, this was my logic;
find fault, please:

Assuming that you either cannot or will not resolve
all EOL hyphens into "-" and null, or into 'soft hyphen'
and 'hard hyphen,' the usual approaches to recording
them may be said to fall into these categories:

-- if you're (otherwise) marking line-breaks, you can treat
    the EOL hyphen as an attribute of some <lb>s.

-- if you're (otherwise) marking words, you can treat the
    EOL hyphen as an attribute of some words.

-- if you're marking both, you can readily do either

-- if you're marking neither, you can readily do neither.

As Martin knows, and regrets, our TCP practices fall
into the last category: we cannot distinguish special
kinds of words, or special kinds of line-breaks, since
we capture neither words (as such) nor line breaks;
nor can we rely on dictionaries, since our materials
employ a hugely various orthography and many languages.
We therefore treat line-internal hyphens (which are always
hard) and EOL hyphens (which are usually but not always hard)
as two different *characters* that happen to look alike,
and then index them differently. The hyphen indexes as space
and the EOL hyphen indexes as null. Only rarely do
we have the confidence and time to resolve the EOL
hyphens (usually in specific compound-rich titles),
when we do, resolution consists of either removing
them or converting them to the 'real' (i.e. hard)
hyphen character.

As to Martin's question about the costs of
omitting (i.e. resolving) hyphens, the obvious answers
are three: time and labour (which can be very considerable);
ambiguity (there are not as many tea-cups/teacups about
as one might expect, but there are certainly some); and
precise alignment of text and source. I do not think
there is much analytical loss, except conceivably
insofar as hyphenation is potentially evidence for
perceived syllabification or morphemic division.
Time and labour are for us the deciding factors
in leaving them in.


Paul Schaffner | [log in to unmask] |
316-C Hatcher Library N, Univ. of Michigan, Ann Arbor MI 48109-1205