Le 11/10/2014 09:23, Roberto Rosselli Del Turco a écrit :
> You cite a different case, that of computational linguistics
> annotation: as you note, there are specialized formats that would
> probably serve you better than converting everything in TEI XML, so I
> think that the strategy of providing TEI encoded texts for "general"
> use and a specific format for linguistic analysis makes perfect sense.
*"general" use versus linguistic analysis*
Today, it is not inconceivable to consider that NLP tools project
linguistic knowledge automatically inside texts with sufficiently good
results, such as lemma and POS linguistic annotations for example, so
that they are more and more used by all disciplines of the humanities -
say for content analysis - not only for linguistic analysis.
Typically, historians do content analysis on NLP based lemmatized texts
established by classical philologists (articulating three different
levels of objects and disciplines goals).
*"general" TEI format versus linguistic analysis specific format*
The gap between a "general" TEI format and a linguistic analysis
specific format first doesn't come from their purpose, but from the fact
that the later activity tipically concerns ALL the words of a given
text. In "general" TEI encoding, we can "generally" consider that only
some specific words need encoding attention.
The fact is that an XML text encoded at the word level for EVERY word is
difficult to manipulate if you have no adapted tools and user
interfaces. So, generally, you don't use the same tools and each tool
tends to prefer an efficient format. But nothing prevents those tools
from sharing a common or compatible format.
Secondly, you should consider that it is often not possible to directly
compute the "words" (tokenize) of a "general" TEI encoded text, because:
- the 'base text' can be tricky to separate from the rest of the XML
- some words can have a whole encoding tree inside their graphical form
so it can be difficult to get a "surface form" right
- the <choice> deus ex machina beast
In the TXM software, we develop tokenizers by specifying which TEI
elements may contain 'base text' content and delimit or break word or
sentence linguistic levels. This must be tuned for each TEI idiom.
*tightening here and there*
> As a side note, looking at texts encoded by colleagues using the
> transcr module I noticed that often I would have made (almost) exactly
> the same choices, so that the end product looks remarkably similar.
> Except for some cases where there are too many different ways to do
> the same thing ... but I guess not everything TEI may become SIMPLE ;)
> (although some tightening here and there would be a good thing!).
Every tightening has a purpose:
- TEI light
- TEI tight
- Bare bones TEI
- TEI simple
What is yours?
Dr. Serge Heiden, [log in to unmask], http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883