On Thu, 2007-05-03 at 22:20 +0100, Elena Pierazzo wrote:
> > Or d) lemma information should not necessarily be encoded in the text
> > stream but encoded elsewhere and pointed at using a reference.
>
> Actually, this is the solution we have adopted for the specific project
> that arose the initial question. Such solution was practicable because
> the project includes a sort of dictionary of all the lemmas founded in
> the source material (Anglo Saxon Charters, btw).
>
> But what about if we did not had a dictionary at all? I mean, I was
> involved in the past in projects that were lemmatising just for search
> purposes and we did not included any lemma collection (or 'lemmario') at
> all (I'm in particularly thinking to the lemmatized Works of Dante). In
> this case a different solution would have been more suitable.
You don't need a dictionary for the pointing to work: a minimal index of
lemmas used in the project would be enough.
The mechanism is a kind of feature structure. The main reason for
preferring it over anything else is that it reduces the information to
atomic form: no matter how many instances of a given "word" you have,
you don't have to worry about variation creeping into to your
lemmatisation if you define the lemmas in one place and point to them
from the instances (of course you can get the pointers wrong, but that
is easily checkable). We use a similar structure in the header all the
time to list things we will be pointing to elsewhere in the document.
This wouldn't go in the header, but the principles are the same: keep
the data in a single place, and avoid markup on the atts.
In something huge like Dante, I can see the lemma list to be an
extremely useful tool. You are probably building one anyway--one assumes
you aren't composing the lemmas on the spot for each word--so this is a
way of just using the work.
>
> So my opinion that both options (b) and (d) are to be carried on
> together, to give the opportunity to use one or the other (or possibly
> both?) according to the project needs.
>
> Best,
>
> Elena
--
Daniel Paul O'Donnell, PhD
Department Chair and Associate Professor of English
Director, Digital Medievalist Project http://www.digitalmedievalist.org/
Chair, Text Encoding Initiative http://www.tei-c.org/
Department of English
University of Lethbridge
Lethbridge AB T1K 3M4
Vox +1 403 329-2377
Fax +1 403 382-7191
Email: [log in to unmask]
WWW: http://people.uleth.ca/~daniel.odonnell/
|