Print

Print


Am 13.10.2014 um 18:08 schrieb Serge Heiden:
> Dear Roberto,
>
> Le 11/10/2014 09:23, Roberto Rosselli Del Turco a écrit :
> > You cite a different case, that of computational linguistics
> > annotation: as you note, there are specialized formats that would
> > probably serve you better than converting everything in TEI XML, so I
> > think that the strategy of providing TEI encoded texts for "general"
> > use and a specific format for linguistic analysis makes perfect sense.
> *"general" use versus linguistic analysis*
> Today, it is not inconceivable to consider that NLP tools project
> linguistic knowledge automatically inside texts with sufficiently good
> results, such as lemma and POS linguistic annotations for example, so
> that they are more and more used by all disciplines of the humanities -
> say for content analysis - not only for linguistic analysis.
> Typically, historians do content analysis on NLP based lemmatized texts
> established by classical philologists (articulating three different
> levels of objects and disciplines goals).

Yes, I completely agree.

> *"general" TEI format versus linguistic analysis specific format*
> The gap between a "general" TEI format and a linguistic analysis
> specific format first doesn't come from their purpose, but from the fact
> that the later activity tipically concerns ALL the words of a given
> text. In "general" TEI encoding, we can "generally" consider that only
> some specific words need encoding attention.
> The fact is that an XML text encoded at the word level for EVERY word is
> difficult to manipulate if you have no adapted tools and user
> interfaces. So, generally, you don't use the same tools and each tool
> tends to prefer an efficient format. But nothing prevents those tools
> from sharing a common or compatible format.
> Secondly, you should consider that it is often not possible to directly
> compute the "words" (tokenize) of a "general" TEI encoded text, because:
> - the 'base text' can be tricky to separate from the rest of the XML
> - some words can have a whole encoding tree inside their graphical form
> so it can be difficult to get a "surface form" right
> - the <choice> deus ex machina beast
> - etc.
> In the TXM software, we develop tokenizers by specifying which TEI
> elements may contain 'base text' content and delimit or break word or
> sentence linguistic levels. This must be tuned for each TEI idiom.

I think this is a major point. This tuning could be an argument for
having project specific translators of TEI into more streamlined formats
for further processing/analysis, which then represent one,
purpose-specific interpretation of the variety of information encoded in
TEI.

But I think one could distinguish two use cases here:

* Extracting a base text for further processing/annotation
(tokenization, tagging, etc.).
* Representing linguistic annotation for further processing in TEI.

In our project, we currently have two corpora that each pose one
challenge: In one corpus, we have only structural information in TEI, so
we extract a base text and add linguistic annotations using tools from
computational linguistics. We currently use the results as one-off
intermediate steps for further analysis and don’t store them back in
TEI. But the MorphAdorner approach that Martin Mueller mentioned in the
NLTK thread looks like an interesting approach.

In the other corpus, we already have linguistic annotations, and we’d
like to keep them in our TEI version. We have found a way to do so, but
I don’t think the current ways of representing linguistic annotations is
already very satisfactory.

* There is @lemma and @lemmaRef, which work really well.
* There is a generic @ana whose semantics are almost entirely project
specific.
* There is <fs> which is like a verbose version of @ana – i.e., a
generic entry point for arbitrary data. With one exception: *If* your
project chooses to use the ISOcat system, the datcat system allows to
specify linguistic categories in a linked data fashion (which we do).

When I started looking into this, my first naive expectation was
something like <gramGrp> in <w>: The dictionary module allows to encode
POS, gender, number, etc., but only for dictionary entries, not text
words. And I must say I still find this compelling, as it would provide
explicit semantics instead of generic containers.

Best,
Frederik



> *tightening here and there*
> > As a side note, looking at texts encoded by colleagues using the
> > transcr module I noticed that often I would have made (almost) exactly
> > the same choices, so that the end product looks remarkably similar.
> > Except for some cases where there are too many different ways to do
> > the same thing ... but I guess not everything TEI may become SIMPLE ;)
> > (although some tightening here and there would be a good thing!).
> Every tightening has a purpose:
> - TEI light
> - TEI tight
> - Bare bones TEI
> - TEI simple
> What is yours?
>
> Best,
> Serge
>



-- 
Frederik Elwert M.A.

Wissenschaftlicher Mitarbeiter
Projektkoordinator SeNeReKo
Centrum für Religionswissenschaftliche Studien
Ruhr-Universität Bochum

Universitätsstr. 150
D-44780 Bochum

Raum FNO 01/180
Tel. +49-(0)234 - 32 24794