Print

Print


Dear list members,

I am currently working on a massive corpus of verse-aligned religious  
texts (Bibles, mostly, but also Qur'an editions) for linguistic and NLP  
purposes. In the beginning, I've been adapting the CES specifications  
Philipp Resnik developed decades ago for a similar, small-scale project  
(in XML, not his SGML, of course). As we have outgrown the scale of his  
project by lengths, it is about time to update our format to a more recent  
standard, and TEI might be the format of choice.

Yet, there are certain aspects specific to a parallel corpus of bibles,  
and I was wondering how to represent them with TEI:

- All bibles share the same set of verse identifiers, but occasionally, a  
set of verses is not translated literally, but loosely translated within a  
larger segment. We introduced an additional attribute altid (alternate  
id), a sequence of NMTOKENS, each of which represents a regular bible ID  
(we did not chose IDREFS because they are not defined within the  
document). What would be the most efficient way to represent this properly?

e.g. a multi-verse segment from a Low German (Westphalian) bible (in our  
CES-adaptation):

<seg altid="b.MAT.17.22 b.MAT.17.23">
	Os soe sik in Galiläa uphoelen, sia Jesus: Doe Minskensuone
	sall baule den Hännen fan den Minsken iutliewert weren. Soe
	weret en dautmaken, owwer am drüdden Dage sall hoe wir upston.
	Do woören soe olle bedroöwet.
</seg>

vs. a verse segment in another Low German (Plautdietsch) bible

<seg id="b.MAT.17.22" type="verse">
	Aus see enn Galilaea eromm jinje, saed Jesus to an: "De
	Menschesaen woat boolt enn Mensche aeare Henj jejaeft woare,
</seg>
<seg id="b.MAT.17.23" type="verse">
	en dee woare am doot moake, oba aum drede Dach woat hee fomm
	Doot oppstone." En siene Jinja weare seeha truarich do aewa.
</seg>

We query with XQuery across all bibles for a verse ID to compare  
differences across languages and language stages. The altids are inspected  
if a seg with the corresponding ID isn't found.

- Not only seg, but also div elements may carry the altid attribute, e.g.,  
for non-literal poetic bible adaptations where we have chapter- or  
book-level alignment only, but where smaller structures (e.g., l) exist.

- altid also comes in handy if we want to mark cross-references to other  
bible passages that contain literal repetitions, e.g. (from the 1611 King  
James Version):

<seg id="b.EXO.20.12" altid="b.DEU.5.16" type="verse">
	Honour thy father and thy mother: that thy dayes may bee long
	vpon the land, which the Lord thy God giueth thee.
</seg>

<seg id="b.DEU.5.16" altid="b.EXO.20.12" type="verse">
	Honour thy father and thy mother, as the Lord thy God hath
	commanded thee, that thy daies may be prolonged, and that it
	may goe well with thee, in the land which the Lord thy God
	giueth thee.
</seg>

With our querying strategy, these altids will be relevant if we want to  
retrieve matches from a Bible where the exact verse is lost, but a  
near-analogon is found, nevertheless. This specific verse is, for example,  
also quoted several times in the New Testament, and for languages with an  
NT only, we would like to have these matches if we query for b.EXO.20.12  
or b.DEU.5.16.

In TEI, the id would correspond to an xml:id, but what would be a good  
strategy to preserve the altid information without creating a large  
overhead (as using the index element would entail) ?

Thanks a lot,
Christian Chiarcos
-- 
Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: [log in to unmask]
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931