> > There is a difference between using a soft hyphen to record where
> > this hyphen *did* actually occur (my original understanding, and
> > apparently everyone else's here) and using a soft hyphen to
> > indicate where a word-break *may* occur (as the html specs seem
> > to imply).
> > Is the HTML specification wrong then? Or am I missing something?
> The HTML spec is right and lynx treats ­ this way (IIRC).
I don't know why you would say that. I don't claim to have any
authority on this issue, but my immediate inclination is that which
I believe Julia Flanders and Micheal Beddow allude to in their
earlier posts on this thread: that neither approach is *right*,
they're just *different*.
But if I were forced to call one *right* and one *wrong*, I would
argue that the HTML spec is incorrect. The semantics "processors
should preferentially put a soft hyphen here if this word needs to be
broken" should not be put on the soft hyphen character, which (I have
read) was intended by ISO 8859-1 to be used to represent a hyphen
that had been stuck at the end of a line by the formatter. The sad
part is, if I understand correctly, there is no correct Unicode
character to bear the semantic HTML -- quite reasonably -- desires.
(The character U+2027 "HYPHENATION POINT" might seem like a good
candidate, but I believe it is intended for use in dictionaries to
represent those spots where an HTMLer is might want to put in a
character sequence with the "preferentially break here" semantic, and
TEIers might find a soft hyphen on a page :-) However, it seems quite
reasonable to me to argue that such a spot should be represented by
an empty element, not a character.
For a discussion that provides evidence that even if wrong, at least
I'm not alone, see, e.g.,
http://mail.nl.linux.org/linux-utf8/2000-09/msg00075.html
> If you are interested in encoding the line break it the best to
> use something like:
> <corr sic="Tren|nung">Trennung</corr>
> and to describe it in the teiHeader. This has the nice side
> effect that simple shell tools like 'grep' will be able to find
> the word, too.
Assuming you have no other use for the character "|" I suppose this
would work, but I really don't like the implication of <corr> that
the soft hyphen in the source is in error. Wouldn't
<reg orig="Tren|nung">Trennung</reg>
or, what I argue is equivalent,
Tren<reg orig="|"></reg>nung
be better as it is neutral as to which is 'correct'?
P.S. I think
Tren­<lb/>
nung
is the most correct encoding, although I myself use
Tren­
<lb/>nung
|