Print

Print


In my opinion, you would do better to put the lemma value into an element of its
own. The attribute value approach is really only suitable for simple cases.

So if it was me, I would define new elements <form> and <lem> as specialised
kinds of <seg> (i.e. as synonyms for <seg type="form"> and <seg type="lem">) and
then mark it up thusly:


<w>
<lem>in primis</lem>
<form>in prrrrrimmmmissss</form>
</w>

This means you can put markup into the <lem> as well as spaces

Alternatively, you could adopt a simple convention like this:

<w lem="in_primis">....</w>

Redefining the datatype of the @lem attribute to accept spaces as you propose
would be a bit problematic since that changes the definition. Of course, you
could also argue that it *shouldn't* be defined as data.word... but it currently is!



message <[log in to unmask]> Elena Pierazzo
<[log in to unmask]> writes:
> This is a multi-part message in MIME format.
> --------------010005090407060100080705
> Content-Type: text/plain; charset=ISO-8859-15; format=flowed
> Content-Transfer-Encoding: 7bit
> 
> Dear all,
> 
> I'm working in a project with a strong lexicographical component so we 
> are lemmatizing all the words. For this purpose we are using:
> 
> <w lemma="">word</w>
> 
> but we are in trouble with multiword expressions (e.g. "in primis").
>  From a lexicographical point of view it is matter of a single entry 
> (separating the expression in "in" and "primis" is simply nonsensical).  
> The problem is that
> 
> <w lemma="in primis">in primis</w>
> 
> is not valid as the lemma definition is
> 
> <attList>
>      <attDef ident="lemma" mode="change">
>         <desc>identifies the word's lemma (dictionary entry form).</desc>
>         <datatype minOccurs="1" maxOccurs="1">
>            <rng:ref xmlns:rng="http://relaxng.org/ns/structure/1.0" 
> name="data.word"/>
>         </datatype>
>      ...
>      </attDef>
> </attList>
> 
> 
> I can modify the definition, but I was thinking that my problem can be 
> rather common (for instance, Italian language contains thousands of 
> multiword expressions...) and would like to submit the question to 
> everybody.
> 
> Bests
> 
> Elena
> 
> 
> 
> -- 
> Elena Pierazzo
> Associate Researcher
> Centre for Computing in the Humanities
> King's College London
> Kay House 7 Arundel St
> London WC2R 3DX
> 
> Phone: 0207-848-1949
> Fax: 0207-848-2980
> 
> --------------010005090407060100080705
> Content-Type: text/html; charset=ISO-8859-15
> Content-Transfer-Encoding: 8bit
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
>   <meta content="text/html;charset=ISO-8859-15"
>  http-equiv="Content-Type">
> </head>
> <body bgcolor="#ffffff" text="#000000">
> <font size="-1"><font face="Verdana">Dear all,<br>
> <br>
> I'm working in a project with a strong lexicographical component so we
> are lemmatizing all the words. For this purpose we are using:<br>
> <br>
> &lt;w lemma=""&gt;word&lt;/w&gt;<br>
> <br>
> but we are in trouble with multiword expressions (e.g. "in primis"). <br>
> From a lexicographical point of view it is matter of a single entry
> (separating the expression in "in" and "primis" is simply
> nonsensical).  The problem is that <br>
> <br>
> &lt;w lemma="in primis"&gt;in primis&lt;/w&gt;<br>
> <br>
> is not valid as the lemma definition is<br>
> <br>
> &lt;attList&gt;<br>
>      &lt;attDef ident="lemma" mode="change"&gt;<br>
>         &lt;desc&gt;identifies the word's lemma (dictionary entry
> form).&lt;/desc&gt;<br>
>         &lt;datatype minOccurs="1" maxOccurs="1"&gt;<br>
>            &lt;rng:ref xmlns:rng=<a class="moz-txt-link-rfc2396E"
href="http://relaxng.org/ns/structure/1.0">"http://relaxng.org/ns/structure/1.0"</a>
> name="data.word"/&gt;<br>
>         &lt;/datatype&gt;<br>
>      ...<br>
>      &lt;/attDef&gt;<br>
> &lt;/attList&gt;<br>
> <br>
> <br>
> I can modify the definition, but I was thinking that my problem can be
> rather common (for instance, Italian language contains thousands of
> multiword expressions...) and would like to submit the question to
> everybody.<br>
> <br>
> Bests<br>
> <br>
> Elena<br>
> <br>
> <br>
> <br>
> </font></font><span class="moz-txt-tag">-- <br>
> </span>Elena Pierazzo
> <br>
> Associate Researcher
> <br>
> Centre for Computing in the Humanities
> <br>
> King's College London
> <br>
> Kay House 7 Arundel St
> <br>
> London WC2R 3DX
> <br>
> <br>
> Phone: 0207-848-1949
> <br>
> Fax: 0207-848-2980
> <br>
> </body>
> </html>
> 
> --------------010005090407060100080705--
>