*That* is a really excellent question. And I'm very interested to hear
that PG is thinking of switching to TEI: it will be a very interesting
experience in TEI project management!
For a project on the scale of PG, I'd think myself that the real
problems would come when it gets to the level of the phrase rather than
the div and the chunk. All volunteers will know what a paragraph is, for
example, but <mentioned> can a bit of a hard concept for people and
spread over thousands of volunteers seems to me very likely to run into
real problems with difference of interpretation.
Keeping in the spirit of something Peter said, I wonder if the best
approach might be to use relatively generic tags for things like bold
and italics and underlining in texts where the meaning is not clear or
open to the least ambiguity, certainly when mass tagging. These could be
later extracted and looked at by people who need a finer level of
In this sense, <hi rend="italic"> is a legitimate structural
understanding of a strong in italics in a document. Since books do not
themselves distinguish explicitly between different uses of
italics--neat examples include grammars from the 1950s, where italics is
used for both <foreign> and <mentioned> and sometimes both at the same
time--distinguishing among uses in mass markup like this can be seen as
marking distinctions that may or may not be actually intended by the
print encoder (i.e. the typesetter). If you see the PG work as
diplomatics--i.e. transcription of primary sources--then interpreting
what the marks may mean semantically is really an editorial decision
that belongs to someone else.
If this approach makes sense, then a legitimate question might be how
far to carry out the use of <hi rend="italics"> instead of semantic
markup. I confess my first instinct would be to say restrict your code
set and use it even when the sense is relatively obvious--i.e. as in
most uses of foreign--because even these cases will not always be
obvious and it might make sense not to have some words that are arguably
foreign being marked up as <foreign> and others that are probably
foreign but less certainly so marked up as <hi>.
On Tue, 2007-06-03 at 12:44 -0500, Joshua Hutchinson wrote:
> On 3/6/07, Julia Flanders <[log in to unmask]> wrote:
> > Since I don't know very much about Jon's project, it's hard for me to
> > say at this point whether the semantic nuance he asks about is
> > pointless, essential, or somewhere in between, but it's certainly an
> > interesting area to explore.
> I'd hazard that Jon's question was prompted (at least in large part)
> due to conversations we've been having about Project Gutenberg's
> efforts to switch to a TEI-based master encoding.
> So, knowing that our "markup editors" will be volunteers coming from a
> largely book-loving background and not a scholarly background (and
> hence tend to think in terms of layout vs in terms of semantics), how
> would you approach this type of issue? ie, How strictly would you
> like to see PG stick to "semantic markup only" philosophy? Where is
> the balance between ease of markup and good strict practices?