thank you for pointing that out. I think their "isRelatedTo" and its
subproperties (especially isCloseRenderingOf and isLooseRenderingOf) would
be well-suited. I'm still hesitating, though, because if I understand it
correctly, it requires one additional element with at least two attributes
for every verse I'd like to address, adding 3*r nodes to the document
(with r being the number of cross-references).
<relation ref="saws:isCloseTranslationOf" active="#div1.i001"/>
As soon as multiple types of relations are to be distinguished (not yet, I
cannot tell them apart automatically), this seems to be the solution.
Until then, something more compact would be preferrable.
On Thu, 07 Nov 2013 12:01:51 +0100, Gabriel Bodard
<[log in to unmask]> wrote:
> Dear Christian,
> I can't speak in great detail to the TEI markup and data model you
> suggest below, but it occurs to me that there might be parallels between
> and value in exploring compatibility with the markup devised for similar
> purposes by the Sharing Ancient Wisdoms (SAWS) project: see
> One of the aims of this project is to encode multiple mediaeval and
> ancient texts, some of which are collections of fragments of earlier
> texts, align them to various translations (close or loose) and to other
> texts of which they various segments might be copies, paraphrases,
> translations, or merely influenced by.
> To this end they used (1) CTS URNs (as URIs) for all texts and segments
> of texts, enabling pointing in both directions with minimal overhead in
> terms of intervention and insertion of ids in the text; (2) an ontology
> of text object and relationship types, described at
> (probably overkill for your purposes, but a minimal subset of it would
> be easy to devise); (3) a series of `tei:relation` elements to define
> the relationships between texts, places, persons, and other objects in
> the corpus.
> I'm not involved in either project, but as a glance it seems to me that
> a model along these lines might well work for the issues you are
> describing too. If you're interested in more information, I believe one
> or two of the SAWS developers are on this list (and they can probably
> correct some of my comments above, too).
> On 2013-11-07 10:02, Christian Chiarcos wrote:
>> Dear list members,
>> I am currently working on a massive corpus of verse-aligned religious
>> texts (Bibles, mostly, but also Qur'an editions) for linguistic and NLP
>> purposes. In the beginning, I've been adapting the CES specifications
>> Philipp Resnik developed decades ago for a similar, small-scale project
>> (in XML, not his SGML, of course). As we have outgrown the scale of his
>> project by lengths, it is about time to update our format to a more
>> recent standard, and TEI might be the format of choice.
>> Yet, there are certain aspects specific to a parallel corpus of bibles,
>> and I was wondering how to represent them with TEI:
>> - All bibles share the same set of verse identifiers, but occasionally,
>> a set of verses is not translated literally, but loosely translated
>> within a larger segment. We introduced an additional attribute altid
>> (alternate id), a sequence of NMTOKENS, each of which represents a
>> regular bible ID (we did not chose IDREFS because they are not defined
>> within the document). What would be the most efficient way to represent
>> this properly?
>> e.g. a multi-verse segment from a Low German (Westphalian) bible (in our
>> <seg altid="b.MAT.17.22 b.MAT.17.23">
>> Os soe sik in Galiläa uphoelen, sia Jesus: Doe Minskensuone
>> sall baule den Hännen fan den Minsken iutliewert weren. Soe
>> weret en dautmaken, owwer am drüdden Dage sall hoe wir upston.
>> Do woören soe olle bedroöwet.
>> vs. a verse segment in another Low German (Plautdietsch) bible
>> <seg id="b.MAT.17.22" type="verse">
>> Aus see enn Galilaea eromm jinje, saed Jesus to an: "De
>> Menschesaen woat boolt enn Mensche aeare Henj jejaeft woare,
>> <seg id="b.MAT.17.23" type="verse">
>> en dee woare am doot moake, oba aum drede Dach woat hee fomm
>> Doot oppstone." En siene Jinja weare seeha truarich do aewa.
>> We query with XQuery across all bibles for a verse ID to compare
>> differences across languages and language stages. The altids are
>> inspected if a seg with the corresponding ID isn't found.
>> - Not only seg, but also div elements may carry the altid attribute,
>> e.g., for non-literal poetic bible adaptations where we have chapter- or
>> book-level alignment only, but where smaller structures (e.g., l) exist.
>> - altid also comes in handy if we want to mark cross-references to other
>> bible passages that contain literal repetitions, e.g. (from the 1611
>> King James Version):
>> <seg id="b.EXO.20.12" altid="b.DEU.5.16" type="verse">
>> Honour thy father and thy mother: that thy dayes may bee long
>> vpon the land, which the Lord thy God giueth thee.
>> <seg id="b.DEU.5.16" altid="b.EXO.20.12" type="verse">
>> Honour thy father and thy mother, as the Lord thy God hath
>> commanded thee, that thy daies may be prolonged, and that it
>> may goe well with thee, in the land which the Lord thy God
>> giueth thee.
>> With our querying strategy, these altids will be relevant if we want to
>> retrieve matches from a Bible where the exact verse is lost, but a
>> near-analogon is found, nevertheless. This specific verse is, for
>> example, also quoted several times in the New Testament, and for
>> languages with an NT only, we would like to have these matches if we
>> query for b.EXO.20.12 or b.DEU.5.16.
>> In TEI, the id would correspond to an xml:id, but what would be a good
>> strategy to preserve the altid information without creating a large
>> overhead (as using the index element would entail) ?
>> Thanks a lot,
>> Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany
office: Robert-Mayer-Str. 10, #401b
mail: [log in to unmask]