Print

Print


In the following I would like to address a problem I have with joining
fragmented "spans" of direct speech marked with the Q-tag.

This is not (really) about linking pieces of quotation!

This email includes some basic questions about the structure of TEI as
well.


I will be using the following terms:

Speech:
        a speaker's complete utterance, possibly interrupted by an
        inquit-formula ('...,' he/she said, '...'), possibly extending
        across several paragraphs or text divisions


Span (of speech):

       anything that appears between opening and closing quotation MARKS
       ('...');
       thus, not necessarily a complete speech;
       belongs to the TEXTUAL level


Fragment (of speech):

       anything that appears between opening and closing quotation TAGS
       (<q> ... </q>);
       belongs to the TEI level


I want to specifically join/link "spans" of quotation, and in fact do a
bit more than just link them.


To make it clearer why I want to do this, and for which purposes,
and what the problem is, here's the example case:


The text to be transcribed and encoded is a passage from a 16th century
Welsh translation of the Bible.

The divisions occurring are chapters and verses, to be encoded as div1
and div2 respectively.

The text, of course, has passages of direct speech, quite often
extending across several verses (ie divisions). Due to the necessity to
properly nest tags, quotations have frequently to be fragmented.

Now, and this is the important bit, direct speech is NOT MARKED in the
text itself.

Nevertheless, we want to encode direct speech, using, if possible, the
Q-tag.

We do not want to editorially insert quotation marks at this stage (in
the TEI file) - thus keeping content and style apart.

However, we would like to be able to insert quotation marks later on,
via a style-sheet for instance.

To be able to insert the quotation marks at the right places, these
places (beginning of quote, end of quote, and possibly
continuation of quote in between) have to be identifiable, of course.


As may have become clear, the target entity, so to speak, is the
"span" of quotation (the text to appear between opening and
closing quotation marks), not the speech as a whole.


Now I'm looking for a straightforward, and simple, way to do this, join
the fragments of a "span" of quotation, and mark beginning, middle, and
end of it.


Possible solutions, and problems with them:

Solution 1
Only link the fragments of a SPAN, not all the pieces of a whole SPEECH,
using prev/next for instance. (Thus leave different spans, which make up a
speech, unlinked.)

  Problems with this:
  - Linking seems to be basically meant to join pieces of the speech as a
    whole, thus the target entity is the speech, not the span.
    Thus, linking only the spans effectively seems to distort the data.
  - More importantly, beginning, middle and end of a span cannot easily be
    identified, using linking.
    The beginning *could* be identified by the fact, that only "next"
    appears as linking attribute, a middle part has both "next" and "prev"
    attributes, the end only a "prev" attribute. This is very
    complicated, not straightforward, unelegant.

The main issue here is STRUCTURE.
Linking is not about structure. Although it may be used to convey structure,
it does so in a indirect and rather inefficient way.
It just doesn't seem to be the right tool for the task. (If you've got a
can, you may try to open it with a knife, if you insist, but a can-opener
would be a better idea, wouldn't it?)


Solution 2
Replace Q with SEG TYPE="quote-span". SEG has a part attribute,
which could easily identify the structure (beginning, middle and end) of a
span of quotation.

   Problem:
   I wouldn't like to give up the Q-tag for such a comparatively
   simple task and it's time consuming. It doesn't
   really look appealing to me.
   After all, using SEG avoids the issue, would only cure a
   sympton, it doesn't solve the problem where it arises.


3. Modify the DTD by adding a PART attribute to the Q-tag itself.

   Advantages:
   - allows you to stick to the Q-tag;
   - tackles the problem where it arises;
   - links the span-fragments AND identifies beginning etc. of a quotation
     in a straightforward manner;
   - allows you to distinguish fragments from spans [and thus spans from
     speeches, depending on the point of view], if you ever need to
     link both separately, for different purposes.

   Problems:
   - Interference with other elements, attributes ???
   - I've only got a very vague idea how to modify the DTD. I know that
     modifications go into a separate file. But how EXACTLY would this
     file look like with the above mentioned modification? How and where
     do I include it in my text file? (And let's assume I want to use TEI
     Lite.)


So far, this has been about my personal problems (in terms of TEI) and
possible personal solutions.

I would, however, like to take the point further and make some remarks on
TEI in general. I may be completely wrong, but I see the following basic
problem with the way TEI handles quotations, for instance, and there may be
consequences to be drawn on a larger scale:

- TEI does not distinguish between fragments (of quotation-spans), which
  belong to the TEI level, as they are TEI-induced, and quotation-spans
  (of a speech), brought about by inquit-formulae interrupting the
  speech, which belong to the textual level.

  This seems to be, in my view, conceptually fuzzy and possibly
  problematic.
- It is TEI that forces you to fragment.
- The direct "victim" of the TEI-induced fragmentation in this case is the
  "span" of quotation, NOT the speech as whole.
- Although it is TEI that causes the fragmentation, it offers no way to
  specifically balance this fragmentation again. In a way, TEI doesn't
  take the responsibility for the consequences of what it causes.
- Because it lumps them together, TEI can't address the different
  "entities" of spans and fragments separately.

The cases where this happens may be few and far between, and I could
understand that TEI Lite wouldn't care, but at least the full TEI should
take account of it.

I would like to point out, however, that a part attribute for the
Q-element would be very helpful on a much larger scale (not just for some
obscure Welsh text from the 16th century):

Quotation marks are a big problem in terms of text-interchange, as they
are indistinguishable from apostrophes (as has been pointed out time and
again).

There are different approaches to the problem, none of them really
convincing in my view.

1. Mark up the quotation with the Q-tag and keep the quotation marks as
they are.

This solution contains some measure of redundancy, still mixes content
with style and disambiguates the quotation marks only indirectly.


2. Use the Q-tag and replace the quotation marks with their entity names
(ldquo etc.)

Still redundant, mixes style with content, but at least disambiguates the
quotation marks. Problem is: entity characters are risky (typos), or at
least cumbersome. And, unlike other special characters, you can't start
with using the "real" characters in your first transcript and later search
and replace them, just because they are indistinguishable from
apostrophes, and opening and closing quotation marks respectively (for
which you may want to use different characters after all) can often not safely
be identified.


3. Just use the Q-tag and omit the quotation marks altogether (thus, let
the Q-tag do the job on its own). This is basically the most appealing
solution.

Problem:
As I've said above, you may wish to re-insert the quotation marks at a
later stage, and for the reasons given above you can't do this in a
straightforward manner or in a way which guarantees that you end up with the
right quotation marks at the right places.

Because the Q-element has no part-attribute.

So, quite generally speaking, a part attribute for the Q-element would
come in handy. It would make the Q-element a lot more powerful than it is
now, and could help to separate content from style. Quotation marks could
safely be omitted.


This argument could be taken further: It may seem desirable to equip all
TEI-elements that are likely to be fragmented on the TEI-level with the
part-attribute, as a standard TEI-remedy for a TEI-created problem. As
structure will probably be an issue in most of the cases, linking is not
the proper tool for this.


The Q-element poses another problem: quotes within quotes (within quotes
[within quotes]).

Although TEI allows to nest quotations within quotations, it does not
offer a straightforward way to make the hierarchy of these quotations
explicit if you want to.

(If you think about what will happen when you try to re-insert quotation
marks previously omitted or not present in the first place, you'll see the
problem.)

I find this slightly surprising, as TEI is all about structures and
hierarchies and making them explicit so that dumb computers can
understand them. (And some of them are dumber than others, not having been
granted a first-rate education, ie soft-ware.)

I, personally, would welcome the possibility to "hierarchize"
quotation-elements. Say, alongside Q, I'd like to have Q2, Q3, Q4 as well
(with Q doubling, if possible, as both an un-numbered Q-division,
nestable within itself, and as a first-level quotation - a virtual Q1 -
for subsequent q2, q3, q4; but this is a different issue).


Again, it may be worthwhile thinking about introducing hierarchizations
for other elements which can be nested within themselves. So far, this
hierarchization only seems to exist for elements above the paragraph
level.


As may have become obvious, I'm trying to make a case for including - in
future versions of TEI -:

- the part-attribute for a number of elements, Q amongst them, which are
likely to get fragmented on the TEI-level, in order to balance the
fragmentation, link the fragments AND efficiently convey the structure of
what has been fragmented; and

- the (optional) hierarchization of paragraph-level elements which are
nestable within themselves, again to explicitly convey their structure
and hierarchy so that it can be targeted by processing software.


I hope I was able to give sufficient practical and general reasons for
both, and why these modifications may be helpful for a wider audience.


There's the other issue, as TEI is as it is, how to implement
modifications like these user-individually.


I would very much welcome comments and advice.


Ingo Mittendorf

University of Cambridge,
Department of Linguistics,
Sidgwick Avenue,
Cambridge CB3 9DA
UK