Print

Print


Hello Alpo,

The first idea (admittedly somewhat kludgey in flavour) is to use <seg 
type="s">, which can ten contain the <cit>s.

Another is: since you focus on linguistic markup, and are going to use 
<w>s, make sure to equip them all with @xml:id attributes, and then 
point from outside to the sequences of <w>s that make up quotations. 
This way your linguistic markup is going to be clean, and the quotations 
will be identified and described outside of it. An elaboration on this 
would be to use <seg type="quote"> (with @xml:id) for the relevant 
sequences, *inside <s>*. Then your processing software knows immediately 
what to ignore, and your visualisation software can easily join the 
outside description of your quotation (the <ref> or <bibl>) with the 
relevant <seg>.

HTH,

   Piotr

On 30/06/15 13:43, Alpo Honkapohja wrote:
> Dear TEI-list,
>
> I work for a corpus project called Medieval Latin from Anglo-Saxon
> Sources at the university of Zurich.
> http://www.research-projects.uzh.ch/p15805.htm
>
> The corpus is based on editions, which typically identify quoted
> passages in footnotes and separate the quoted passage either by
> quotation marks or italics. A high priority in compiling our corpus is
> to keep the quotations separate from the running text, so that someone
> wanting to carry out a corpus analysis on anglisms in the Latin of
> Byrhtferth of Ramsay will not end up having long stretches of text
> quoted straight from the vulgate Bible or Venerable Bede in their results.
>
> […] et de profundis clara uoce huius seculi proclamare, quoadusque
> benediceret illum pius Christus, ‘qui fecit celum et terram’. [footnote
> in the edition: quotation from Bible, Ps 123 (124), 8].
>
> For the time being, we have been using the following encoding:
>
> […] et de profundis clara uoce huius seculi proclamare, quoadusque
> benediceret illum pius Christus, <cit><quote>qui fecit celum et
> terram</quote><ref>quotation from Bible, Ps 123 (124), 8</ref></cit>.
>
> However, I have currently been adding <s>-tags for sentences, for
> reasons of citation and as a point of compatibility with Toronto
> Dictionary of Old English corpus, which encodes everything as sentences.
> Since the resource is intended for linguistic research, the plan is
> eventually to add <w> tags for individual words and POS-information
> produced by an automatic parser/tagger.
>
> This leads to the problem that <cit>, <quote> or <q> tags are not
> allowed inside tag used for linguistic segment categories (whereas <s>
> tags are allowed inside quotations), so the following is not valid:
>
> <s>[…] et de profundis clara uoce huius seculi proclamare, quoadusque
> benediceret illum pius Christus, <cit><quote>qui fecit celum et
> terram</quote><ref>quotation from Bible, Ps 123 (124), 8</ref></cit>.</s>
>
> As an interim solution, I have been adding <note> tags around the <cit>
> tags, but this strikes me as clumsy, and creates occasional nesting
> problems like this:
>
> Bede said he wanted: ‘to leave the monastery. It was just too hot in the
> summer.’
>
> ** <s>Bede said he wanted: <quote>‘to leave the monastery.</s> <s>It was
> just too hot in the summer.’</s></quote>
>
> I would be looking for a solution which would:
>
> - clearly keep the quoted material separate so that anything tagged as
> quotation can be left out in a corpus search,
>
> - not to interfere with the tags used for linguistic annotation, which
> are of major importance (words inside quotes will be POS-tagged as well),
>
> - be sufficiently ‘one-size fits all’, so the same tags could be used
> for quotations of various length from one word to several sentences to
> entire paragraphs. We’ve got 300000+ words and hundreds if not thousands
> of quotations.
>
> Thanks in advance!
>
> Best Wishes,
>
> Alpo Honkapohja, post-doc
>
> University of Zurich
>