Martin Mueller wrote:
> I have been thinking about retroactively tagging spoken sections in
> various fiction corpora, and I wonder whether anybody has advice on
> the utility or feasibility of such a project.
The classic example of this, long before TEI -- or SGML -- was thought of,
was John Burrows' encoding, in a database-like format, of Jane Austen's
novels. [see J.B., Computation into Criticism, Oxford 1987]. Burrows'
encoding allowed direct speech to be distinguished by character, and, more
interestingly perhaps, allowed free indirect speech nominally "uttered" by
the narrator to be assigned to the character whose "erlebte Rede" it was,
and plainly distinguished from strictly third-person narrative passages.
Basically he segmented the text into separate records and prefixed each of
them by a series of descriptors. This is suprisingly easy to do with Austen,
but I woudn't like to try it with James Joyce or Musil.
Perhaps the best-remembered of Burrows' findings (by which I suppose I
really mean the one I remember, because I used to drag it into most of my
introductory lectures on text encoding and analysis) came from his initially
accidental failure to enable a stop-list when looking for differentiators
between the reported speech of different characters. He had expected the
unintended inclusion of "the", "and" "but" etc to swamp any differentiating
features, but when he looked at the results of the run, he found to his
surprise that he got statistically stronger differentiations by considering
the frequency and collocations/colligations of such "stopwords" in the
speech assigned to specific characters than he did from more apparently
"characteristic" stylistic or lexical features.
This has always to my mind somewhat weakened the standard argument that
co-occurrences with articles or determiners mustn't be called
"collocations", since they are mere side effects of the preponderance of
such forms, and need to be filtered out by T- or Z-scoring methods, but then
I'm not altogether convinced by the statistical basis of much textual
analysis, since it seems to me that the null hypothesis of a text which is
purely random assemblage of tokens would not be a text at all. Fine if the
aim of our efforts is to provide statistical evidence that something is a
text rather than a random collection of words, but most of us can more or
less detect that anyway without applying markup and resorting to
word-crunching. Not so fine, though, if we are supposed to discern
statistical significance of co-occurrences or sequences against the
background of a null hypothesis that could hold good only if there were no
textuality present at all. Of course, the measurement of collocation was
first introduced, long before grander schemes of corpus linguistics, and
indeed before machine computation, when stylometrics were developed as a
means of authorship attribution, an application there you do indeed have a
unimpeachable null hypthothesis (correspondence with the observed
frequencies and co-occurrences in a document of known authorship) and hence
a falsifiable claim. Similarly, by working on the level of putatively
different "signatures" in the speech of the characters, whether given
directly or via style indirect libre, Burrows has a distinct null hypothesis
that lets his statistics bite.