As I have mentioned in one of my previous mails,  we are working
on the Integrated Language Database of 8th-21st Century Dutch, in
which dictionaries, lexica and a diachronic corpus will be linked.
The corpus will be tei-encoded. We are now developing our minimal
tagging level.
One of the problems we have encountered is the butter-23fly
example, i.e. the first half of butterfly being on page 22 and the
second half on p. 23.
We would like to encode instances like these as follows:

<reg orig='butter- 23 fly>butterfly <pb n="23"></reg>

since we also intend to morphologically tag the entire corpus,
preferably fully automatically, and for that a complete wordform
presents less complications.
Are there any objections to this solution?


Katrien Depuydt

Katrien A.C. Depuydt
Instituut voor Nederlandse Lexicologie
(Institute for Dutch Lexicology)
redacteur Taalbank
(editor Dutch Language Database)

