> > PS : I have developped a software for dealing with such milestone
> > elements: that is, converting them back into "normal" markup (at
> > the cost of converting into milestones the conflicting elements),
> > so that it may still be processed with standard XML tools. I will
> > be hapy to help if you're interested.
>
> It is quite likely that the entire special interest group on overlap
> would be interested in hearing more about this software. Would you be
> interested in continuing this discussion on their mailing list? (For
> those not on that list but interested, It is at
> http://listserv.brown.edu/archives/tei-ol-sig.html.)
I will try to describe the choice I have made for encoding overlapping hierarchies. My terminology is probably very imprecise (or, more exactly, false...).
The program "CorpusReader" (http://panini.u-paris10.fr/~sloiseau/CR) include a function for dealing with intersecting hierarchies. The documentation of this function is here :
http://panini.u-paris10.fr/~sloiseau/CR/filtres/ExtractMilestone.html#d2e4604
(the documentation is in french, but the exemple is commented in english).
This filter is still quite limited: for instance, you cannot express a relation between a start boundary and an end boundary in terms of pointers in a attribute:
<q xml:id="q1"/> ... <q corresp="#q2"/>
But you can deal with nested-milestoned-hierarchies:
<q type="start"/> ... <q type="start"/> ... <q type="end"/> ... <q type="end"/>
I make the choice not to try to keep expression of the dominance. So I used massively milestone elements (in fact, most often, "typed segment boundary delimiters", as I just learned :-)) [1]: I mark even words with (true) milestones.
My rationale is that, while milestone/paired elements give you the greatest expressivity (since the data are not coerced into doing a tree), I still need to turn them temporaly into normal element for processing purpose (using high level language, etc.). And with streaming API like SAX, where the document is view in "precedence order" [i'm not sure I can say that, anyway a depth-first walk in the tree] rather than in a dominance-first access [?], this conversion appears to be quite easy. I just need to define the start and end boundary to be converted into start and end tag, and to take a decision about the conflicting markup: discarding it, turning it into milestones, etc., depending on the actual processing need.
The inherent limitation of this solution is obviously that I can *not* turn into regular element all the markup, which was the initial problem... But if regular elements are seen as a processing need, not a goal /per se/, it appears (in my limited experience...) that I do not have need to turn all the markup in regular elements at the same time. For instance, when I want to a make frequency lists of given phenomena in all the sentences, I turn sentence into regular element for expressing easily iterativity (via dominance) on the sentence, and can count the included elements -- expressed as milestone or not, it doesn't matter (as in the exemple: http://panini.u-paris10.fr/~sloiseau/CR/filtres/ExtractMilestone.html#d2e4604). So it is like a distinction between a state for encoding and a state for processing.
I don't try to keep expression of the dominance because perhaps the "tree data model" looks not expressive enough for complex data and should not use as a model, but only as format. (I would be interested to know at what extend the "OHCO" model, still refered in the TEI P5, is still relevant in the TEI: it seems to me that it define the text (the data) by the structure of the format. Thus, it is not surprising that this representation "turns out to be very effective for a large number of purposes", since the goal has parhaps been defined in terms of its solution):
http://www.tei-c.org/release/doc/tei-p5-doc/html/SG.html#SG152
I'm wondering if the various solution for expressing overlapping hierarchies may not be classified according to the degree they emphasis on expressing precedence or dominance:
- For instance, stand-off markup keep entirely the dominance (at least in each stand-off documents viewed separatly), but aligning several stand-off documents seems complex. If I understand correctly, one can only access alignement which have been /a priori/ defined (by providing pointers for aligning the different trees), which is a very heavy limitation to the empirical richness of the data.
- In "Reconstitution of Virtual Elements", the dominance is preserved too, at the coast of reconstituting elements. The main drawback, in my view, is that this solution turn to grow in complexity with the density of the markup. For instance, if you have several segmentations in sentences (performed by two different linguistic parsers), frequently slightly diverging, it may entail very heavy markup. Its usability seems quite restricted to some area with few overlapping hierarchies. It looks a little as a patch: the tree is no longuer really the model of the data, but dominance is preserved at high cost.
- The milestone (or Horse? I have some article from Extrem markup to read...) model use only precedence and do not express dominance, if I understand correctly. In fact I was wondering if the underlying model is not a graph rather than a tree, close to what describe Bird and Liberman in "A formal framework for linguistic annotation". XML is used as a file format, but we get rid with encoding the data as if they were hierarchical. I'm wondering I it couldn't be very interesting to try to provide the data model describe by Bird & Liberman in TEI vocabulary.
I'm interested in comment, since I plan to write something about this in an academic work, and would be glad not to make to many false assumptions...
-------
[1]
In fact I found the formulation in the TEI ambiguous: the sentence "For example, if quotations are marked using (user-defined) empty elements given the names ‘qb’ and ‘qe’ [...]" (http://www.tei-c.org/release/doc/tei-p5-doc/html/NH.html#NHMI) may be heard as a definition of milestone as paired element, or I don't undestand correctly?
--
Sylvain Loiseau
[log in to unmask]
http://panini.u-paris10.fr/~sloiseau
|