To preserve the line and page information of the
printed text and throw it away before starting
linguistic analysis might be an option, but you
would have to make provisions for words like "back-
up" or "hand-me-down" to mark the difference between
proper hyphens and soft-hyphens at line-break level.
In some languages you may have even more combinations
with words and hyphens and other punctuation marks
inside words to look after.

Another option could be to resolve the hyphenation
using an attribute of the <w>-element, e.g. by making
@norm (from att.lexicographic) (or another suitable
attribute?) available to <w>:

> <w @norm="handwriting"> hand-<pb/>writing</w>
> <w @norm="electronically">electro<lb/>nically</w>

and use @norm for linguistic analysis.
This way you stay in-line with the printed source
and you do not hand over the decisions to typists.

Taking out hyphens in an undocumented way is not an
option. Right?

Quoting Sebastian Rahtz

> does an attribute on the <pb> or <lb> deal with it?
>  hand-<pb/>writing
>  electro<lb rend="softhyphen"/>nically
> (or syntactic sugar of <lbh/> for typists).
> this lets you throw away all <pb> and <lb> before
> analyzing.
> or am I missing something?
