Print

Print


Dear Paolo,

Thanks for the link, impressive work! It's going to be a handy reference.

As for your question, on whether or not to prepend the '#', I would say 
that it's a kludge either way, for different reasons, and I think in 
such cases it's the practical factors that come to the fore. If it's 
more work and maintenance for you to prepend the '#' only to cut it off 
for querying/visualization, then I'd say don't bother...

It's a perfect illustration for part of our motivation for creating the 
ticket: a corpus creator, upon looking at this sort of "dilemma" on 
which kludge to use, may simply decide not to use the TEI at all, or 
will hack it his way, and we're going to see yet another variation where 
there could be a simple standardized approach. But maybe we need 15 more 
cases of a similar sort to begin to sound convincing? I wonder.

Best regards,

   Piotr



On 01/05/18 19:01, Paolo Monella wrote:
> Dear Eduard and Piotr,
> 
> thank you for your insights. I do hope that the proposal of the LingSIG 
> [1] is accepted. If useful, you might mention my own Ursus project [2] 
> as a use case, but I am sure that there are plenty of already existing 
> use cases.
> 
> I am currently encoding as follows:
> 
> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
> 
> so I am not prepending a "#" to "4-S--------". It would take only a 
> little VI find/replace to prepend the "#", and minor changes in the JS 
> and Python scripts to make them process it (by removing it).
> But I am reluctant to do so because I agree with the argument in the 
> ticket that it is a kludge.
> 
> No lint or parser gave me a failed validation because of this.
> 
> Do you still suggest that I prepend the "#"?
> 
> Best,
> Paolo
> 
> [1] Ticket https://github.com/TEIC/TEI/issues/1670
> [2] http://www.unipa.it/paolo.monella/ursus
> 
> 
> 
> Il 05/01/2018 17:41, Piotr Bański ha scritto:
>> Dear Paolo,
>>
>> One more question/nitpick. You say:
>>
>>  > "#p-acp" is no valid pointer (no valid URI)
>>
>> Well, it is not, but it's a valid fragment identifier (see [1]), and 
>> somewhere in the maze of W3C specs, there is a statement on 
>> interpreting bare fragment identifiers as being virtually appended to 
>> the URI of the current document, yielding a correct (longer) URI. So I 
>> think that you are fine, syntactically (or have you actually got a 
>> failed validation result? I'd be very curious to see a test case 
>> then), but obviously not semantically (we address this "pretend that 
>> POS values are fragIDs, just for the sake of the tei.pointer datatype" 
>> issue in the text of the github ticket to which I pointed you, 
>> alongside other arguments against using @ana for this purpose).
>>
>> Best regards,
>>
>>    Piotr
>>
>> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>>
>>
>>
>> On 01/05/18 16:39, Piotr Bański wrote:
>>> Dear Paolo,
>>>
>>> Please have a look at the proposal addressing this at 
>>> https://github.com/TEIC/TEI/issues/1670
>>>
>>> It avoids the "POS-in-@ana" issue, and provides arguments for that. 
>>> You will also see there a list of projects that use the proposed 
>>> format, some of them based on MorphAdorner.
>>>
>>> The practical question for you now, I guess, is either to keep the 
>>> existing TEI skeleton and disobey the @ana datatype or adopt the 
>>> changes we have suggested in the ticket and put the POS information 
>>> where it belongs, hoping that the Council will address the issue 
>>> before the end of the world. It's a gamble... :-)
>>>
>>> Best wishes,
>>>
>>>   Piotr
>>>
>>>
>>> On 01/02/18 21:11, Paolo Monella wrote:
>>>> Dear all,
>>>>
>>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file 
>>>> and want to encode the result in attributes of <w>.
>>>>
>>>> I searched the TEI-L archives and the Internet. I found that 
>>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output 
>>>> (e.g. "adjective, positive genitive plural masculine"):
>>>>
>>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>>
>>>> I had tried this encoding:
>>>>
>>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>>
>>>> The main difference is that MorphAdorner prepends a "#" to the value 
>>>> of @ana because this value should be a teidata.pointer [2].
>>>>
>>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do 
>>>> you think I should leave my encoding as it is, or prepend "#" as in 
>>>> @ana="#4-S--------"?
>>>>
>>>> Thank you,
>>>> Paolo
>>>>
>>>> [1] See paragraph "Simplified TEI P5-like output" in 
>>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/ 
>>>>
>>>> [2] 
>>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
>>>>
>>>>
>>>