[ I am forwarding these comments from Stig Johansson on the recent
detailed comments by Geoffrey Sampson at his request: Stig has
been having trouble getting mail through to TEI-L. If this has been
affecting anyone else on EAN, please would they let me know -- LB]
Delivery-date: Mon, 22 Oct 1990 14:59:48 UTC+0100
From: Stig Johansson <[log in to unmask]>
To: <[log in to unmask]>
Subject: Geoffrey Sampson's comments
First of all, a brief comment on SGML-stripping, which has been the subject
of much recent discussion on TEI-L. It seems to me that it should be far easier
to strip off SGML tags, which are represented in a highly consistent manner
and are accompanied by proper documentation, than to eliminate all sorts of
ad hoc and/or inadequately documented markup.
But markup is there for a purpose, and it is surely more important to
consider how it should be organised. The current TEI proposals are clearly
a step in the right direction. They can certainly be improved, however.
Detailed comments are essential, as in the recent posting from Geoffrey
Sampson (GS). I will take up a couple of the points he raises.
GS says that these are inappropriate as a basis for reference systems and
that they do not apply to speech. The special term 'S-unit' was introduced
to indicate that these objects are not identical to sentences defined by
grammatical criteria. They are orthographic sentences and apply only to
written material. If they have been found to be convenient in writing, they
will most certainly also be so in machine-readable versions of written
texts (and compilers of machine-readable texts have indeed often introduced
special markers to identify such units). I do not deny that there are
problems of demarcation, but they do not seem insurmountable. Whether
S-units or some other mechanisms are used as a basis for the reference system
of a text cannot be decided once and for all. The current TEI proposals outline
a variety of mechanisms. The main thing is that there IS a consistent
reference system and that there is documentation on it in the file header.
GS is correct in drawing attention to the last comment on p. 103, which
suggests that you can predict the final punctuation mark of an S-unit. This
is clearly wrong and conflicts with what is said about the same matter on
p. 105 (last paragraph). The inconsistency must be put down to the
circumstances in which the draft was produced (multiple authorship, time
The representation of spoken texts
GS is right in pointing out that little attention is given to speech in
the current TEI proposals. The idea is not to neglect the representation of
spoken texts but to leave it to a later stage in the project. While we
have accepted conventions for the representation of written texts which
we can build on in developing guidelines for machine-readable texts, there are
no such conventions for speech. On the one hand, we have edited transcripts
of speech which look more or less like writing. On the other, there are
spoken corpora produced by linguists which give a faithful transcription
(indicating false starts, overlapping speech, pauses, stress, intonation, etc)
and do not impose written conventions. I am thinking particularly of the
London-Lund Corpus of spoken British English. It will be a major task in the
next stage of the TEI project to suggest guidelines for the representation
of spoken texts.
Incidentally, S-units do not apply to speech, but there is of course a need
for some sort of segment of this kind. The London-Lund Corpus uses tone
units (also used for reference purposes). Perhaps, on analogy with S-units,
we could speak of T-units. T-units are shorter (normally) and are more
indirectly related to grammar than S-units. They are realised by phonological
means, while S-units are identified orthographically.
Not everybody accepts the notion of the tone unit. Some have suggested
pause-defined units. Some Scandinavian spoken corpus projects have used
the notion of the 'macro-syntagm' (defined grammatically). Whatever unit
is chosen for speech, this should be stated in the file header. Perhaps it is
sufficient, as the TEI guidelines suggest (p. 103), to define a SEGMENT tag,
with a TYPE attribute.
The bulkiness of SGML-type tagging
GS mentions the problem of "the very bulky and hard-to-read format imposed by
the TEI standards". This bothers me as well, although I realise that there
is supposed to be software which can suppress tags or convert the text to
a more readable format. Checking a tagged text is extremely difficult, even
a text with fairly simple tagging. I have a lot of experience from the tagging
of the LOB Corpus (where each word was provided by a word-class tag). Some
decisions are very hard to make and they may become even harder with a
bulky tagging system. Validating the tagging with reference to a DTD
would only solve part of the problem.
I expressed my feelings in an impetuous note last summer which was put on
TEI-L (although it was not intended for this forum). I appreciate the
responses from Robert Amsler, Robin Cover, and And Rosta. My mind is not
quite at rest, however. The next question worries me even more.
To what extent can we predict text structure?
GS draws attention to "the fairly anarchic formats found in most kinds of
real-life documents". To what extent does this represent a problem for
SGML-type tagging? In a recent posting Robert Amsler says about the
machine-readable version of the OED that there was no DTD "since the printed
work ... was written before the tags were invented and added into its text".
But surely this applies to texts more generally?
To what extent can document structure be defined in advance? Does nobody
else see a problem here?
Oslo October 1990