Print

Print


Dear Frederik (and all),

This is an interesting exchange -- thanks.

Let me address a minor detail from your message:

 > * There is a generic @ana whose semantics are almost entirely project
 > specific.

Its semantics is merely to provide a pointer (or pointers) at another 
container, in particular, this can be a dictionary entry (given that you 
mentioned gramGrp).

And I believe that you meant to contrast <gramGrp> with <fs> below:

 > When I started looking into this, my first naive expectation was
 > something like <gramGrp> in <w>: The dictionary module allows to encode
 > POS, gender, number, etc., but only for dictionary entries, not text
 > words. And I must say I still find this compelling, as it would provide
 > explicit semantics instead of generic containers.

But note that the semantics of gramGrp are not that explicit at all:

* from the point of view of containers, take <gender>: is this a 
reference to the sex [semantic], inflection class [lexical], or 
agreement class [syntactic]? Similarly with <iType>, which is not enough 
for some languages (see e.g. 
http://sourceforge.net/p/tei/feature-requests/276/ )

* from the point of view of content, chaos may reign inside gramGrp 
elements, whereas you can use a feature structure declaration to 
restrict the content of feature values in various ways.

-- this is all just to say that <fs> is a powerful and maybe somewhat 
underestimated tool. Naturally, not all projects need all this power, 
and it may be enough to use @type for some subtler distinctions and 
possibly ODD to define some further constraints, so please treat the 
above as a side remark.

Best regards,

   Piotr


On 13/10/14 19:05, Frederik Elwert wrote:
> Am 13.10.2014 um 18:08 schrieb Serge Heiden:
>> Dear Roberto,
>>
>> Le 11/10/2014 09:23, Roberto Rosselli Del Turco a écrit :
>>> You cite a different case, that of computational linguistics
>>> annotation: as you note, there are specialized formats that would
>>> probably serve you better than converting everything in TEI XML, so I
>>> think that the strategy of providing TEI encoded texts for "general"
>>> use and a specific format for linguistic analysis makes perfect sense.
>> *"general" use versus linguistic analysis*
>> Today, it is not inconceivable to consider that NLP tools project
>> linguistic knowledge automatically inside texts with sufficiently good
>> results, such as lemma and POS linguistic annotations for example, so
>> that they are more and more used by all disciplines of the humanities -
>> say for content analysis - not only for linguistic analysis.
>> Typically, historians do content analysis on NLP based lemmatized texts
>> established by classical philologists (articulating three different
>> levels of objects and disciplines goals).
>
> Yes, I completely agree.
>
>> *"general" TEI format versus linguistic analysis specific format*
>> The gap between a "general" TEI format and a linguistic analysis
>> specific format first doesn't come from their purpose, but from the fact
>> that the later activity tipically concerns ALL the words of a given
>> text. In "general" TEI encoding, we can "generally" consider that only
>> some specific words need encoding attention.
>> The fact is that an XML text encoded at the word level for EVERY word is
>> difficult to manipulate if you have no adapted tools and user
>> interfaces. So, generally, you don't use the same tools and each tool
>> tends to prefer an efficient format. But nothing prevents those tools
>> from sharing a common or compatible format.
>> Secondly, you should consider that it is often not possible to directly
>> compute the "words" (tokenize) of a "general" TEI encoded text, because:
>> - the 'base text' can be tricky to separate from the rest of the XML
>> - some words can have a whole encoding tree inside their graphical form
>> so it can be difficult to get a "surface form" right
>> - the <choice> deus ex machina beast
>> - etc.
>> In the TXM software, we develop tokenizers by specifying which TEI
>> elements may contain 'base text' content and delimit or break word or
>> sentence linguistic levels. This must be tuned for each TEI idiom.
>
> I think this is a major point. This tuning could be an argument for
> having project specific translators of TEI into more streamlined formats
> for further processing/analysis, which then represent one,
> purpose-specific interpretation of the variety of information encoded in
> TEI.
>
> But I think one could distinguish two use cases here:
>
> * Extracting a base text for further processing/annotation
> (tokenization, tagging, etc.).
> * Representing linguistic annotation for further processing in TEI.
>
> In our project, we currently have two corpora that each pose one
> challenge: In one corpus, we have only structural information in TEI, so
> we extract a base text and add linguistic annotations using tools from
> computational linguistics. We currently use the results as one-off
> intermediate steps for further analysis and don’t store them back in
> TEI. But the MorphAdorner approach that Martin Mueller mentioned in the
> NLTK thread looks like an interesting approach.
>
> In the other corpus, we already have linguistic annotations, and we’d
> like to keep them in our TEI version. We have found a way to do so, but
> I don’t think the current ways of representing linguistic annotations is
> already very satisfactory.
>
> * There is @lemma and @lemmaRef, which work really well.
> * There is a generic @ana whose semantics are almost entirely project
> specific.
> * There is <fs> which is like a verbose version of @ana – i.e., a
> generic entry point for arbitrary data. With one exception: *If* your
> project chooses to use the ISOcat system, the datcat system allows to
> specify linguistic categories in a linked data fashion (which we do).
>
> When I started looking into this, my first naive expectation was
> something like <gramGrp> in <w>: The dictionary module allows to encode
> POS, gender, number, etc., but only for dictionary entries, not text
> words. And I must say I still find this compelling, as it would provide
> explicit semantics instead of generic containers.
>
> Best,
> Frederik
>
>
>
>> *tightening here and there*
>>> As a side note, looking at texts encoded by colleagues using the
>>> transcr module I noticed that often I would have made (almost) exactly
>>> the same choices, so that the end product looks remarkably similar.
>>> Except for some cases where there are too many different ways to do
>>> the same thing ... but I guess not everything TEI may become SIMPLE ;)
>>> (although some tightening here and there would be a good thing!).
>> Every tightening has a purpose:
>> - TEI light
>> - TEI tight
>> - Bare bones TEI
>> - TEI simple
>> What is yours?
>>
>> Best,
>> Serge
>>
>
>
>