Print

Print


On a first quick glance, it seems to me that the existing TEI @key 
attribute has almost exactly the semantics of your "@altid". I wouldn't 
recommend <index> for this purpose -- it's something different.

L


  On 07/11/13 10:02, Christian Chiarcos wrote:
> Dear list members,
>
> I am currently working on a massive corpus of verse-aligned religious 
> texts (Bibles, mostly, but also Qur'an editions) for linguistic and 
> NLP purposes. In the beginning, I've been adapting the CES 
> specifications Philipp Resnik developed decades ago for a similar, 
> small-scale project (in XML, not his SGML, of course). As we have 
> outgrown the scale of his project by lengths, it is about time to 
> update our format to a more recent standard, and TEI might be the 
> format of choice.
>
> Yet, there are certain aspects specific to a parallel corpus of 
> bibles, and I was wondering how to represent them with TEI:
>
> - All bibles share the same set of verse identifiers, but 
> occasionally, a set of verses is not translated literally, but loosely 
> translated within a larger segment. We introduced an additional 
> attribute altid (alternate id), a sequence of NMTOKENS, each of which 
> represents a regular bible ID (we did not chose IDREFS because they 
> are not defined within the document). What would be the most efficient 
> way to represent this properly?
>
> e.g. a multi-verse segment from a Low German (Westphalian) bible (in 
> our CES-adaptation):
>
> <seg altid="b.MAT.17.22 b.MAT.17.23">
>     Os soe sik in Galiläa uphoelen, sia Jesus: Doe Minskensuone
>     sall baule den Hännen fan den Minsken iutliewert weren. Soe
>     weret en dautmaken, owwer am drüdden Dage sall hoe wir upston.
>     Do woören soe olle bedroöwet.
> </seg>
>
> vs. a verse segment in another Low German (Plautdietsch) bible
>
> <seg id="b.MAT.17.22" type="verse">
>     Aus see enn Galilaea eromm jinje, saed Jesus to an: "De
>     Menschesaen woat boolt enn Mensche aeare Henj jejaeft woare,
> </seg>
> <seg id="b.MAT.17.23" type="verse">
>     en dee woare am doot moake, oba aum drede Dach woat hee fomm
>     Doot oppstone." En siene Jinja weare seeha truarich do aewa.
> </seg>
>
> We query with XQuery across all bibles for a verse ID to compare 
> differences across languages and language stages. The altids are 
> inspected if a seg with the corresponding ID isn't found.
>
> - Not only seg, but also div elements may carry the altid attribute, 
> e.g., for non-literal poetic bible adaptations where we have chapter- 
> or book-level alignment only, but where smaller structures (e.g., l) 
> exist.
>
> - altid also comes in handy if we want to mark cross-references to 
> other bible passages that contain literal repetitions, e.g. (from the 
> 1611 King James Version):
>
> <seg id="b.EXO.20.12" altid="b.DEU.5.16" type="verse">
>     Honour thy father and thy mother: that thy dayes may bee long
>     vpon the land, which the Lord thy God giueth thee.
> </seg>
>
> <seg id="b.DEU.5.16" altid="b.EXO.20.12" type="verse">
>     Honour thy father and thy mother, as the Lord thy God hath
>     commanded thee, that thy daies may be prolonged, and that it
>     may goe well with thee, in the land which the Lord thy God
>     giueth thee.
> </seg>
>
> With our querying strategy, these altids will be relevant if we want 
> to retrieve matches from a Bible where the exact verse is lost, but a 
> near-analogon is found, nevertheless. This specific verse is, for 
> example, also quoted several times in the New Testament, and for 
> languages with an NT only, we would like to have these matches if we 
> query for b.EXO.20.12 or b.DEU.5.16.
>
> In TEI, the id would correspond to an xml:id, but what would be a good 
> strategy to preserve the altid information without creating a large 
> overhead (as using the index element would entail) ?
>
> Thanks a lot,
> Christian Chiarcos