Print

Print


Dear Frederik,

thanks for your answer. You are right, but... the problem is actually 
harder I drafted in my example, because the different tools actually 
tokenize the text in different ways, besides other options (some do keep 
punctuation, some delete it). The result is that for the same sentence a 
tool considers 5 tokens, the other only 3 (and then they group them in 
phrases differently).  So it won't be easy to map the different unities 
sliced by the tools. That is why I thought about having the original 
text and then the different annotations as options without mapping the 
tokens from one to another. But I am also not sure if that is the best 
way of doing it. I could  use of course stand-off markup and point to 
the order of the characters, something I am somehow reluctant to do...

Regards,

José

On 10/10/2018 12:06 PM, Frederik Elwert wrote:
> Dear José,
>
> I think this really calls for stand-off markup. TEI might still not
> really be up to this kind of alternative segmentation, if I were free to
> choose, I’d probably use a pure stand-off format like PAULA[1].
>
> In a TEI scenario, I’d probably go for something like this:
>
> <s><w xml:id="t1">token1</w> <w xml:id="t2">token2</w> <w
> xml:id="t3">token3</w></s>
> <joinGrp xml:id="jg1" resp="#tool1" exclude="#jg2">
>     <join result="phr" target="#t1 #t2" />
>     <join result="phr" target="#t3" />
> </joinGrp>
> <joinGrp xml:id="jg2" resp="#tool2" exclude="#jg1">
>     <join result="phr" target="#t1" />
>     <join result="phr" target="#t2 #t3" />
> </joinGrp>
>
> When testing this, I realized that a join with a single target is
> currently not allowed. One could simply leave it out, as it does not
> really join anything. But I think one could make an argument to allow
> one-target-joins, as it still creates a virtual wrapping element (phr in
> this case) that one could want also for a single target.
>
> If you don’t have tokens (w) to point to, you would need to resort to
> xpointer or something like that.
>
> Frederik
>
>
> [1]: http://www.sfb632.uni-potsdam.de/en/paula.html
>
>
> Am 10.10.2018 um 07:17 schrieb José Calvo Tello:
>> ​​Dear list,
>>
>>
>> I am trying to find a way of encoding different linguistic annotations
>> from several sources in the same document. Different NLP tools would
>> annotate grammatical information and their divisions could be
>> overlapping; for example, a tool could analyze a sentence with three
>> words like the two first ones belong to a phrase (<s><phr>token1
>> token2</phr><phr>token3</phr></s>), while the second could come to the
>> conclusion that the two last words belong to the same phrase
>> (<s><phr>token1</phr><phr>token2 token3</phr></s>). I thought about
>> using the elements choice, orig and reg for that, although I doubt that
>> was the purpose for the reg element. An example:
>>
>> <s>
>>      <choice>
>>          <orig>token1 token2 token3</orig>
>>          <reg resp="tool1"><phr>token1 token2</phr> <phr>token3</phr></reg>
>>          <reg resp="tool2"><phr>token1</phr> <phr>token2 token3</phr></reg>
>>      </choice>
>> </s>
>>
>> Is there a better element? Should I use another strategy? I would like
>> to maintain  text and annotation close, so the evaluation is easier.
>> Best regards from Würzburg,
>> José Calvo