Print

Print


Dear Eduard and Piotr,

thank you for your insights. I do hope that the proposal of the LingSIG 
[1] is accepted. If useful, you might mention my own Ursus project [2] 
as a use case, but I am sure that there are plenty of already existing 
use cases.

I am currently encoding as follows:

<w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>

so I am not prepending a "#" to "4-S--------". It would take only a 
little VI find/replace to prepend the "#", and minor changes in the JS 
and Python scripts to make them process it (by removing it).
But I am reluctant to do so because I agree with the argument in the 
ticket that it is a kludge.

No lint or parser gave me a failed validation because of this.

Do you still suggest that I prepend the "#"?

Best,
Paolo

[1] Ticket https://github.com/TEIC/TEI/issues/1670
[2] http://www.unipa.it/paolo.monella/ursus



Il 05/01/2018 17:41, Piotr Bański ha scritto:
> Dear Paolo,
> 
> One more question/nitpick. You say:
> 
>  > "#p-acp" is no valid pointer (no valid URI)
> 
> Well, it is not, but it's a valid fragment identifier (see [1]), and 
> somewhere in the maze of W3C specs, there is a statement on interpreting 
> bare fragment identifiers as being virtually appended to the URI of the 
> current document, yielding a correct (longer) URI. So I think that you 
> are fine, syntactically (or have you actually got a failed validation 
> result? I'd be very curious to see a test case then), but obviously not 
> semantically (we address this "pretend that POS values are fragIDs, just 
> for the sake of the tei.pointer datatype" issue in the text of the 
> github ticket to which I pointed you, alongside other arguments against 
> using @ana for this purpose).
> 
> Best regards,
> 
>    Piotr
> 
> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
> 
> 
> 
> On 01/05/18 16:39, Piotr Bański wrote:
>> Dear Paolo,
>>
>> Please have a look at the proposal addressing this at 
>> https://github.com/TEIC/TEI/issues/1670
>>
>> It avoids the "POS-in-@ana" issue, and provides arguments for that. 
>> You will also see there a list of projects that use the proposed 
>> format, some of them based on MorphAdorner.
>>
>> The practical question for you now, I guess, is either to keep the 
>> existing TEI skeleton and disobey the @ana datatype or adopt the 
>> changes we have suggested in the ticket and put the POS information 
>> where it belongs, hoping that the Council will address the issue 
>> before the end of the world. It's a gamble... :-)
>>
>> Best wishes,
>>
>>   Piotr
>>
>>
>> On 01/02/18 21:11, Paolo Monella wrote:
>>> Dear all,
>>>
>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file 
>>> and want to encode the result in attributes of <w>.
>>>
>>> I searched the TEI-L archives and the Internet. I found that 
>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output 
>>> (e.g. "adjective, positive genitive plural masculine"):
>>>
>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>
>>> I had tried this encoding:
>>>
>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>
>>> The main difference is that MorphAdorner prepends a "#" to the value 
>>> of @ana because this value should be a teidata.pointer [2].
>>>
>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do 
>>> you think I should leave my encoding as it is, or prepend "#" as in 
>>> @ana="#4-S--------"?
>>>
>>> Thank you,
>>> Paolo
>>>
>>> [1] See paragraph "Simplified TEI P5-like output" in 
>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/ 
>>>
>>> [2] 
>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
>>>
>>>
>>