Print

Print


Thank you James: I hadn't thought of the fact that '4-S--------' is as 
valid a pointer as '#4-S--------', though both of them are lies, since 
there is no '4-S--------' file just like there is no '#4-S--------' 
fragment.

So I'll accept the suggestion of Piotr, not to bother about prepending a 
'#' only to cut it off later.

I'm happy to hear that the the Council Face2Face meeting liked @msd.

All best,
Paolo


Il 09/01/2018 18:36, James Cummings ha scritto:
> 
> Hi Paolo and Piotr, etc.,
> 
> 
> In the use of @ana do remember that this is an attribute with a datatype 
> of 1-inf teidata.pointer values. Thus when you have '4-S--------' you 
> are really saying there is a file in the filesystem in this directory 
> called that. I know that you know this, just noting it for completeness.
> 
> 
> I agree, with the intent of the issue 1670 referenced to say that the 
> use of @ana in the linguistic examples is a kludge (though if I were 
> doing that kind of thing I'd be pointing to a <category> of a <taxonomy> 
>   rather than <interp> but that is probably because I'm not a linguist 
> and like the hierarchical flexibility of nested <category elements). I'm
> 
> Since you mention it, there was significant discussion on issue 1670 at 
> the Council Face2Face meeting in Victoria but the ticket wasn't updated 
> then because it wasn't done as part of the ticket-processing sessions 
> but as a main discussion item (as we recognise its importance ... there 
> are much older thornier tickets out there!). The ticket owner should 
> update it when he gets time. My unreliable memory of this is that 
> att.linguistic was strongly supported, includding having @lemma and 
> @lemmaRef in it, that @pos and @msd were also thought ok. I seem to 
> remember that the concept of @join was acceptable but people wondered 
> about whether there was a better name (and I think I wondered what 
> happens if two adjacent words have some form of conflicting @join, i.e. 
> is this an error and should we add schematron for it or something). From 
> my recollection most of the discussion was about the proposed @reg 
> (whose name I certainly don't like for historical reasons). I'm sure I 
> would have argued against the reintroduction of a @reg attribute fearing 
> people would abuse this for what <reg> was created for 
> in editorial transcription and negating the whole war on text-bearing 
> attributes and creation of the <choice> element. I know from the ticket 
> that you think imposing use of <choice> creates too much of a burden for 
> regularisation, but you actually argue more in favour of it when you 
> note that the proposed @reg might need to store multi-word sequences... 
> exactly what we don't want in an attribute! Though your @reg attribute 
> issue 2 on that issue seems to ignore that <w> can self nest? Surely 
> that would be the solution for multi-word units needing a single @reg? 
> And I'm not against the introduction of new linguistic attributes, 
> though think this often ignores the power of XML child hierarchies. 
> Personally, I  want to avoid the storage of any free text of any sort in 
> any attribute, that is I like attribute values to be strongly tied to 
> processable, checkable, datatypes. (Thus I dislike @lemma for the same 
> reason and think @lemmaRef should be used instead wherever feasible!) 
>   So my memory this ticket is that it was going to be moved to status Go 
> (or this and Needs Discussion simultaneously to reflect a need to change 
> a couple aspects of it).
> 
> 
> Best wishes,
> 
> James
> 
> 
> --
> 
> Dr James Cummings, [log in to unmask]
> 
> School of English Literature, Language, and Linguistics, Newcastle 
> University
> 
> ------------------------------------------------------------------------
> *From:* TEI (Text Encoding Initiative) public discussion list 
> <[log in to unmask]> on behalf of Piotr Bański <[log in to unmask]>
> *Sent:* 06 January 2018 19:57:49
> *To:* [log in to unmask]
> *Subject:* Re: PoS tagging in <w> with @ana: pointer?
> Dear Paolo,
> 
> Thanks for the link, impressive work! It's going to be a handy reference.
> 
> As for your question, on whether or not to prepend the '#', I would say
> that it's a kludge either way, for different reasons, and I think in
> such cases it's the practical factors that come to the fore. If it's
> more work and maintenance for you to prepend the '#' only to cut it off
> for querying/visualization, then I'd say don't bother...
> 
> It's a perfect illustration for part of our motivation for creating the
> ticket: a corpus creator, upon looking at this sort of "dilemma" on
> which kludge to use, may simply decide not to use the TEI at all, or
> will hack it his way, and we're going to see yet another variation where
> there could be a simple standardized approach. But maybe we need 15 more
> cases of a similar sort to begin to sound convincing? I wonder.
> 
> Best regards,
> 
>     Piotr
> 
> 
> 
> On 01/05/18 19:01, Paolo Monella wrote:
>> Dear Eduard and Piotr,
>> 
>> thank you for your insights. I do hope that the proposal of the LingSIG 
>> [1] is accepted. If useful, you might mention my own Ursus project [2] 
>> as a use case, but I am sure that there are plenty of already existing 
>> use cases.
>> 
>> I am currently encoding as follows:
>> 
>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>> 
>> so I am not prepending a "#" to "4-S--------". It would take only a 
>> little VI find/replace to prepend the "#", and minor changes in the JS 
>> and Python scripts to make them process it (by removing it).
>> But I am reluctant to do so because I agree with the argument in the 
>> ticket that it is a kludge.
>> 
>> No lint or parser gave me a failed validation because of this.
>> 
>> Do you still suggest that I prepend the "#"?
>> 
>> Best,
>> Paolo
>> 
>> [1] Ticket https://github.com/TEIC/TEI/issues/1670
> <https://github.com/TEIC/TEI/issues/1670>
> 	
> att.linguistic for <w> and <pc> · Issue #1670 · TEIC/TEI 
> <https://github.com/TEIC/TEI/issues/1670>
> github.com
> Quick links: diff of the pull request (will be kept synced against 
> TEIC/TEI/dev) suggested text of the relevant chapter (minimal changes, 
> pending acceptance) suggested documentation of att.lingui...
> 
> 
> 
>> [2] http://www.unipa.it/paolo.monella/ursus
> Ursus from Benevento, De nomine - unipa.it 
> <http://www.unipa.it/paolo.monella/ursus>
> www.unipa.it
> Paolo Monella, Digital scholarly edition of codex Casanatensis 1086, by 
> Ursus from Benevento
> 
> 
> 
>> 
>> 
>> 
>> Il 05/01/2018 17:41, Piotr Bański ha scritto:
>>> Dear Paolo,
>>>
>>> One more question/nitpick. You say:
>>>
>>>  > "#p-acp" is no valid pointer (no valid URI)
>>>
>>> Well, it is not, but it's a valid fragment identifier (see [1]), and 
>>> somewhere in the maze of W3C specs, there is a statement on 
>>> interpreting bare fragment identifiers as being virtually appended to 
>>> the URI of the current document, yielding a correct (longer) URI. So I 
>>> think that you are fine, syntactically (or have you actually got a 
>>> failed validation result? I'd be very curious to see a test case 
>>> then), but obviously not semantically (we address this "pretend that 
>>> POS values are fragIDs, just for the sake of the tei.pointer datatype" 
>>> issue in the text of the github ticket to which I pointed you, 
>>> alongside other arguments against using @ana for this purpose).
>>>
>>> Best regards,
>>>
>>>    Piotr
>>>
>>> [1]: https://tools.ietf.org/html/rfc3986#appendix-A
>>>
>>>
>>>
>>> On 01/05/18 16:39, Piotr Bański wrote:
>>>> Dear Paolo,
>>>>
>>>> Please have a look at the proposal addressing this at 
>>>> https://github.com/TEIC/TEI/issues/1670
> <https://github.com/TEIC/TEI/issues/1670>
> 	
> att.linguistic for <w> and <pc> · Issue #1670 · TEIC/TEI 
> <https://github.com/TEIC/TEI/issues/1670>
> github.com
> Quick links: diff of the pull request (will be kept synced against 
> TEIC/TEI/dev) suggested text of the relevant chapter (minimal changes, 
> pending acceptance) suggested documentation of att.lingui...
> 
> 
> 
>>>>
>>>> It avoids the "POS-in-@ana" issue, and provides arguments for that. 
>>>> You will also see there a list of projects that use the proposed 
>>>> format, some of them based on MorphAdorner.
>>>>
>>>> The practical question for you now, I guess, is either to keep the 
>>>> existing TEI skeleton and disobey the @ana datatype or adopt the 
>>>> changes we have suggested in the ticket and put the POS information 
>>>> where it belongs, hoping that the Council will address the issue 
>>>> before the end of the world. It's a gamble... :-)
>>>>
>>>> Best wishes,
>>>>
>>>>   Piotr
>>>>
>>>>
>>>> On 01/02/18 21:11, Paolo Monella wrote:
>>>>> Dear all,
>>>>>
>>>>> I ran a lemmatizer/PoS tagger (TreeTagger) on a TEI P5-encoded file 
>>>>> and want to encode the result in attributes of <w>.
>>>>>
>>>>> I searched the TEI-L archives and the Internet. I found that 
>>>>> MorphAdorner [1] uses @lemma for lemmata and @ana for the PoS output 
>>>>> (e.g. "adjective, positive genitive plural masculine"):
>>>>>
>>>>> <w lemma="in" ana="#p-acp" reg="in" xml:id="A88624-000740">in</w>
>>>>>
>>>>> I had tried this encoding:
>>>>>
>>>>> <w ana="4-S--------" lemma="in" n="in" xml:id="w315">in</w>
>>>>>
>>>>> The main difference is that MorphAdorner prepends a "#" to the value 
>>>>> of @ana because this value should be a teidata.pointer [2].
>>>>>
>>>>> In any case, also "#p-acp" is no valid pointer (no valid URI), so do 
>>>>> you think I should leave my encoding as it is, or prepend "#" as in 
>>>>> @ana="#4-S--------"?
>>>>>
>>>>> Thank you,
>>>>> Paolo
>>>>>
>>>>> [1] See paragraph "Simplified TEI P5-like output" in 
>>>>> http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/
> MorphAdorner: XML Output - Northwestern University 
> <http://morphadorner.northwestern.edu/morphadorner/documentation/xmloutput/>
> morphadorner.northwestern.edu
> XML Output Introduction. MorphAdorner can add word-level morphological 
> adornments to XML texts encoded in two common formats, the Text Encoding 
> Initiative (TEI ...
> 
> 
> 
>>>>>
>>>>> [2] 
>>>>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html 
> 
> TEI class att.global.analytic 
> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-att.global.analytic.html>
> www.tei-c.org
> P5: Guidelines for Electronic Text Encoding and Interchange. Version 
> 3.2.0. Last updated on 10th July 2017, revision 0fcf651
> 
> 
> 
>>>>>
>>>>>
>>>>