I have been thinking about retroactively tagging spoken sections in
various fiction corpora, and I wonder whether anybody has advice on
the utility or feasibility of such a project.
As for feasibility, it's certainly going to be a tedious business.
You have to look at files one by one and figure out whether through a
combination of authorial pointers (she said) and typographical
devices (quotation marks, dashes, etc) you could get good enough
results (whatever 'good enough' means in that context. And you'd have
to keep your fingers crossed that a script that works for one work or
author will with little labor do other texts well enough. Does
anybody have experience with that kind of work?
As for utility, it is a reasonable assumption that narrative and
speech will differ significantly in just about every text. I learned
this with Homer, where narrative and speech seem on the surface quite
continuous. There was a study some years ago that claimed to
distinguish between the authors of the Iliad and Odyssey on the basis
of the distribution of common words. But what that study measured was
mainly the fact that characters talk more in the Odyssey.
Are there stylometric or thematic analyses for which scholars would
like to have tagged fiction corpora where narrative and speech are
tagged with sufficient accuracy? By sufficient accuracy I mean a
level that would allow a scholar interested in a particular smaller
set of works to bring them up to snuff himself over the course of a