Print

Print


Breakout group decisions from TEI meeting in Myrdal (Sunday, Nov. 16).
Elli Mylonas, David Robey, David Barnard.
 
[This contains questions 1 and 3. Question 2 is coming from David
Robey. EM]
 
1. Lists
1a. Embedded enumerators:
 
The problem: There are cases where enumerated lists are embedded in a
continuous text. These may be full lists, or they may indicate the
beginning of a list, but then peter out and not conclude as they
apparently intended. An example of the former is a newspaper account of
an election, where the results of various districts are interspersed with
commentary in a prose paragraph, but separated by
enumerators[courtesy of Dan  Greenstein]. An example of the latter is the
sentence "First of all, I want to thank everyone very much, and
second...oh dear I forgot what I was going to say, but..." [courtesy of Lou
Burnard].
 
The group's solution:
 
<!element     ENUM.START    - - EMPTY                        >
<!attlist     ENUM.START    END           IDREF             >
 
<!element     ENUM          - - EMPTY                        >
<!attlist     ENUM          N             CDATA
                            END           IDREF             >
 
An <enum.start> element is used to indicate that one of these free form
"lists" is about to begin, and the <enum> tag is used to explicitly mark
where an enumerated item appears. The <enum.start> and <enum> tags
have implied N and END attributes. The N attribute was made optional.
An alternate solution is to make it required, and to suggest that "0" be
used as a default "unenumerated" or automatically numbered
value. It might make sense to require it, since without some indication
of a number or listing, there would be no embedded enumerated list.
 
The optional END attribute is used when the extent of an embedded
enumerated list , or of an enumerated item is known, and the tagger
wishes to indicate it. It always has as its value an IDREF that points to
the ID of an <anchor> element. <anchor> elements are a mechanism
developed by the editors to indicate spans of text; they are zero content
elements with a required ID attribute.
 
1.b. Are glossary and enumerated lists really different enough to warrant
separate tags?
 
The problem: Glossary and enumerated lists have traditionally been
considered as separate items, both because the gloss part of a glossary
list is a meaningful element in itself, unlike the number or typographical
marker used in other lists, and because they have to be treated
differently typographically. For example, a glossary list needs
substantially more indentation and tabbing than a simple enumerated
list, and the gloss cannot be generated automatically.
 
The group's solution:
 
<!element  LIST   - - (head?, (label?, item)+)                   >
<!attlist  LIST   TYPE  ("ordered"|"simple"|"gloss"|"labeled"|"bullet")
                                             "simple"         #IMPLIED>
 
All lists share the same basic structural elements, an optional  label and
the list item itself. The label of a list can take different forms, which may
result in changes in the list's formatting. For example, a list label may be
a number, it may be a bullet or other dingbat, it may be a verbal
identifier, or it may be a gloss.  We decided that it is best to have one list
structure, and to indicate the difference in list type by the value of a
TYPE attribute. This attribute is implied, and its default value is "simple".
This leaves the responsibility of determining how to process or display a
list to the user of the text, and her software.
 
3. Embedded text
 
The problem: A mechanism is necessary in order to specify when an
entire text, or a substantial part of one, is embedded in another. There
are several ways in which this phenomenon occurs. An obvious example
is when a novel contains an embedded or quoted play, poem or other text
within it. For example, a whole chapter of _Moby Dick_ is a play.
A less clearly defined case is that of quotation. In this case,
the embedded text is only a part of a complete work.
Finally, there is the case of texts made up of several different
literary forms. These are not strictly embedded texts, although some
of them may be seen as subordinate to
others. Examples of such texts are a Menippean satire such as the
"Contest of Homer and Hesiod", the collection of the short stories, poems
and theater reviews of Dorothy Parker, or the short stories of Rudyard
Kipling, which consist of paired stories and poems. It is not clear how
these should be dealt with. This is also not necessarily the ideal way to
handle data types like Shakespeare plays, which mix prose and verse
structures in the same speech.
 
The group's solution:
 
<!entity   % body    "(DIV)"                                             >
<!entity   % drama   -- "drama elements" --                              >
<!entity   % prose   -- "prose elements" --                              >
<!entity   % verse   -- "verse elements" --                              >
<!element  TEXT        - - (FRONT?, BODY , BACK?)                        >
<!element  BODY        - - (%body;+)                                     >
 
<!element  DIV         - - (DIV+ | (%drama;|%prose;|%verse;|EMBEDDED.TEXT)+) >
 
<!element  EMBEDDED.TEXT  - - (FRONT?, %body;+ , BACK?)                  >
 
or
 
<!entity   % BODY   "(DIV | DDIV | VDIV)"                               >
<!-- div, vdiv and ddiv are special cases of the generic div          -->
<!-- element. div contains elements belonging to the prose base, ddiv -->
<!-- contains elements from the drama base, and vdiv contains elements -->
<!-- from the verse base.                                             -->
<!entity   % drama   -- "drama elements" --                              >
<!entity   % prose   -- "prose elements" --                              >
<!entity   % verse   -- "verse elements" --                              >
 
<!element  TEXT           - - (FRONT?, BODY, BACK?)                      >
<!element  BODY           - - (%body;+)                                  >
 
<!element  DIV            - - (DIV+ | (%prose; | EMBEDDED.TEXT )+)       >
<!element  VDIV           - - (VDIV+ | (%verse; | EMBEDDED.TEXT )+)      >
<!element  DDIV           - - (DDIV+ | (%drama; | EMBEDDED.TEXT )+)      >
<!element  EMBEDDED.TEXT  - - (FRONT?, %body;+ , BACK?)                  >
 
The best solution to this problem is to allow the  document hierarchy
to start all over again, triggered by the appearance of an <embedded.text> tag.
For taggers who need a more restricted structure, it is possible to
define typed divisions that will invoke a different base set of tags.
This way the embedded structure can contain a new document,
including front and back matter if necessary,
which in turn can contain any elements allowed in
the main document. The tag also marks the special, dependent status of
the embedded document, since it is  always be possible to identify the
embedded elements as being children of <embedded.text>. This tag can
be used not only for the case of the chapter of _Moby Dick_, but also for
smaller embedded texts, like a quoted poem or newspaper account. [Lou
had some good counter-examples to this point, which he should repeat if
possible, since I do not remember them in sufficient detail to do them
justice. EM]
 
This solution is probably not the best way to handle text made up of
mixed types of content model, such as the _Collected Works of Dorothy
Parker_, or even Menippean Satire, unless some forms are considered
subordinate to, or more obviously embedded than others.  Of course, the
Dorothy Parker example could be seen as a document made up of a series
of embedded documents...
 
As defined above, the first DTD fragment allows all types of base content
model (verse, prose, drama or mixed) to appear at any level of <div>, as
well as at any level of embedded text. This permits <embedded.text> to
have a different set of base elements from the main document, so that
drama may be embedded in prose, etc. It also allows succeeding <div>s at
the same level to have different base sets of elements.
 
The second DTD fragment allows more control over the base set of tags
that may be used within <div>s within a document. It defines
three different but parallel <div> elements, one for each of the base sets:
drama, verse and prose. Since each of these types of <div> can only
contain the same type of <div>, it allows the tagger control over what
structures can appear in which places. Using the differentiated <div>s, it
is possible (although not necessary) to change to a different base only
when embedded text occurs. A disadvantage of the differentiated <div>s
is that it may be desirable to change to a different base within a
document in successive <div>s, as in the Rudyard Kipling example given
above. Differentiating between types of base structure by defining
different elements also implies that a different specialized and renamed
<div> is necessary for any new type of base.
 
Another solution to this problem that was suggested, but not
adopted by the group was simply to use the generic <div> element
mark embedded texts.
The problems that arise with this solution are that it would be necessary to
allow <div> to appear anywhere within paragraphs and other low level
elements, because embedded text is not necessarily at the <div> level. Since
<div> is a basic structuring element, permitting it to be used freely
anywhere would obscure any information about the structure of a
document.  Without the <embedded.text> tag, it is impossible to
differentiate between <div>s that belong to the main document, and
<div>s that belong to the embedded text.  So, for example, it would be
impossible to recover what level of <div> was in the embedded text, if
there was embedded text at different levels of the main document hierarchy.
What about the case where the embedded text is itself an outline? It would
be impossible to tell what the levels of outline were simply from their
nesting level in the <div> hierarchy.
 
To forestall an obvious suggestion, the use of the SGML SUBDOC feature
is not an option in the case of embedded text, since it implies a totally
new document, which bears no relation to and has no connection or
interdependence with the main document. This is not the case with
embedded texts.