I think the main problem you have is that the TEI dictionary module was made for encoding machine readable dictionaries rather than lexical databases. So, things like enabling easy automatic generation of the inflectional paradigm from encoded data are not really in scope. That said, TEI has enough different and generic elements that anything can be encoded, if one is stubborn enough..
> together with another colleague, Gyri Smørdal Losnegaard, I am currently trying to encode in TEI a dictionary of multiword expressions (MWE) of various kinds. The resource will be used in Natural Language Processing (NLP) Applications. Despite having read chapter 9 (Dictionaries) of the TEI guidelines, there are still some issues which are not completely clear to us:
> (1) In page 267 there is a reference to what seems to be an inflectional paradigm:
> <iType type=“vbtable”>7</iType>
> <!-- … -->
> - Is there any best practice/guideline as to how this vbtable should look like and where it should be included? We could not find that anywhere else in the guidelines.
I think this is just an example of the kind of info you would find in a dictionary, and I imagine if we were to see this vbtable, it would just be an ordinary <table>.
> - How is the root of the verb indicated? If we are assigning inflectional paradigms to words this should also be indicated to allow for the automatic generation of the whole flection of a specific word. Likewise, this is also needed for retrieving the proper analysis of each word in each case.
If you want to access to the complete paradigm, I'd suggest simply enumerating all the word-forms, i.e. having an "extensional" rather than an "intensional" lexicon - it's larger (but not drastically so, except if you are into Hungarian..), but processing is then trivial.
> (2) How can we encode the morphosyntactic information for each word pertaining to a MWE? In page 271 there is a way of encoding a compound:
> <usg type=“dom”>Comm</usg>
> <form type=“compound”>
> <orth>window <oRef/>
> - How could we indicate the morphosyntactic information of each word of the compound/MWE?
<form type="element"><orth>window </orth> <gram>.../> />
<form type="element"><orth>dresser</orth> <gram>.../> />
> - In collocations, for instance, there are words that inflect and restrictions that need to be complied with. How can this be specified within any particular entry? Would it be possible to link these to particular patterns similar to the paradigms?
As form can be nested, maybe similar to above? The restrictions could presumably go into some sort of gram(Grp) and linking (if you really don't want to specify things in-line) could be done via one of the many TEI pointing attributes.
> - How can we encode the inner structure of the compound/MWE, if needed? (this is useful in the case of subsequently using the dictionary for translation purposes as this inner structure may condition the translation.
Again, maybe nested forms would suffice.
> - How can we encode the morphosyntactic information and inner structure of the translation? In this case, the French translation is just a word, but in other cases we have several words that have to be inflected correctly.
> (3) Finally, in the case of a multilingual database cross-references to the translations would be desirable. Is there a way of establishing links across the dictionary in a similar way to the one it is used when encoding the sentence alignment of parallel corpora?
Just use @corresp? And there is, of course, <xr> but I'm not sure if you could place it exactly enough.
> We will be very grateful if someone can give us some hints as to how to proceed or, even better, examples of how this has been done in other projects.
As I say, using TEI for a lexical DB is stretching it a bit. Not that I haven't tried to do so, e.g. our lexicon of historical Slovene, e.g. http://nl.ijs.si/imp/imp25k/html-s/ but that has quite a simple structure.
> Thank you very much for your help.
> Carla Parra Escartín
> PhD Candidate
> Forskergruppen for lingvistikk og språkstudier
> Universitetet i Bergen
> Institutt for lingvistiske, litterære og estetiske studier (LLE)
> Rom 437
> Sydnesplassen 7, N-5007 Bergen
> Tlf. +47 55588945