Print

Print


We have a very similar problem, which so far we have tried not to think 
about too much, but one day we will want or have to deal with it.

We identify stretches of text that we annotate using xpath expressions 
and offsets:

<j.1:startsAt>/TEI.2[1]/text[1]/body[1]/div[1]/lg[2]/l[3]</j.1:startsAt>
<j.1:offsetAtBegin>26</j.1:offsetAtBegin>
<j.1:endsAt>/TEI.2[1]/text[1]/body[1]/div[1]/lg[2]/l[3]</j.1:endsAt>
<j.1:offsetAtEnd>59</j.1:offsetAtEnd>

What we now do in order to guard against changes in the annotated file 
is to store a checksum of the annotated file with the annotations. We 
also store (part of) the annotated text with the annotations. This gives 
us another way of verifying the integrity of the annotated file.

In the future we'll want to be able to deal with changes in the 
annotated file. I do not expect this to be a very frequent occurrence; 
usually the annotators will be able to work with the older version of 
the annotated XML.

I have been thinking of several solutions:

(1) require the presence of id attributes in the annotated XML, and use 
these to identify the annotated nodes (will not help in the case of 
changes in the text, would severely limit the usefulness of our 
annotation application)

(2) write some kind of comparison program for the old and new XML files 
and have this program produce some kind of delta file describing the 
changes; this delta file should then be the input for another program 
that updates the pointers in the annotation file.

I suppose we'll do (2). I'm not happy with it, but I can't really think 
of another solution.

Peter