Indeed, Martin refers to the MAF proposal, which actually makes a clear difference between two levels:
- the token level, which represents a (possibly arbitrary) segmentation of the surface linguistic level
- the word form level, which is the lexical interpretation of the token level, theoretically interpreting it as mappings to a dictionary.
The lemma information is thus related to the worm form level and since there is no one to one relation between tokens and word forms, any attempt to stick to the token level to represent word form information is potentially flawed. The best approximation here is probably to go for an @ana + interp like Lou suggested.
Le 16 mai 2011 à 14:50, Martin Mueller a écrit :
> We have done something similar in encoding projects that used Phil Burns'
> Morphadorner. We used the vertical dash to separate lemmata. The common
> 17th century spelling 'Ile' is represented as <w lemma="I|will">Ile</w>,
> which is very much like <w lemma="καί ἐκ">κἆκ</w
>
> I take it that Laurent Romary has worked on a more systematic solution for
> the phenomnenon that tokens and lemmata do not always map on a one-to-one
> basis. It strikes me as significant priority for the TEI to create
> consistent and reasonably ease to use rules for for modeling these
> problems within TEI.
>
> On 5/16/11 5:40 AM, "Gabriel Bodard" <[log in to unmask]> wrote:
>
>> I'm not sure this is a very nice solution, but for an old project a few
>> years ago we used two (whitespace separated) values in the @lemma
>> attribute for alternative (or multiple) lemmata.
>>
>> Problems:
>>
>> (1) if your lemmata can contain whitespace, obviously this breaks.
>>
>> (2) the semantics are probably very wrong
>>
>> (3) this doesn't distinguish semantically between cases of uncertainty,
>> on the one hand (e.g. <w lemma="apple apricot">napply</w>) or multiple
>> lexicographic lemmata for a single orthographic word on the other (e.g.
>> <w lemma="καί ἐκ">κἆκ</w>). For us this wasn't a problem, because we
>> wanted both to behave in the same way, i.e. to index under both words in
>> both cases.
>>
>> (I present this not so much in the hope that this will be a viable
>> solution for you, but rather that reactions against this solution might
>> help to reveal the correct solution. :-) )
>>
>> Best,
>>
>> Gabriel
>>
>> On 2011-05-11 22:04, Arun Prasad wrote:
>>> Hi all,
>>>
>>> I'm currently toying with the idea of using the TEI format to represent
>>> lemmatized Sanskrit texts, much as the Clay Sanskrit Library was doing
>>> some time ago. My question, though, is quite general. I've been using
>>> this sort of structure to represent inflected words:
>>>
>>> <w type="verb" lemma="bhR" ana="#3s #pres #indic">bharati</w>
>>> <w type="noun" lemma="buddha" ana="#masc #ins #sg">buddhena</w>
>>>
>>> and it's worked well so far (although I don't know if this is indeed the
>>> proper way to do things). What I'm having some trouble with, however, is
>>> the representation of participles. A participle has some basic "stem",
>>> but it also comes from a verb root. So, the inflected participle
>>> "bharan" could have one of two values for its lemma: the participle stem
>>> "bharat" an the verb root "bhR."
>>>
>>> If possible, I would like to encode both of these values together. Is
>>> there any easy way to do so in the TEI format?
>>>
>>> Thanks,
>>> Arun Prasad
>>
>> --
>> Dr Gabriel BODARD
>> (Research Associate in Digital Epigraphy)
>>
>> Department of Digital Humanities
>> King's College London
>> 26-29 Drury Lane
>> London WC2B 5RL
>>
>> Email: [log in to unmask]
>> Tel: +44 (0)20 7848 1388
>> Fax: +44 (0)20 7848 2980
>>
>> http://www.digitalclassicist.org/
>> http://www.currentepigraphy.org/
Laurent Romary
INRIA & HUB-IDSL
[log in to unmask]
|