On 8/28/2011 2:01 AM, Stuart Yeates wrote:
> However, there are clear semantic differences between HTML and TEI.
Indeed. I would also say there is a difference between them in what
their "semantics" consist of. In particular, I would say that TEI tends
to be (or aspires to be) for the most part "descriptive", whereas HTML
tends to be "application-oriented".
This is despite the fact that HTML's origins are in a document
description format not unlike TEI (albeit poorer). Once upon a time we
might have supposed an HTML 'dl' would be assigned (and only assigned)
to definition lists, and blockquote to block quotes. But that moment was
fairly short; finally a dl is what a dl does, and likewise for
blockquotes and the rest. HTML, notwithstanding efforts to get it to do
more, is rather like SVG or XSL-FO, a tag set more or less meaningless
outside its own application domain. (This is partly because so much more
effort has gone the other way. Consider how microdata proposals for HTML
are generally intended to tie it down more tightly to particular
application semantics, not to loosen it up.)
In contrast, a tag set like TEI along with some others (Docbook, DITA,
NLM/NISO) has a rather weaker relation to particular application
semantics, and this is a source of its strength ... not because such a
binding is not possible or necessary but because it can be asserted at
need. You can do more with TEI texts than just display them; you can (or
at least this is the theory) even do more than the document encoders
specifically provided for you to do.
> All the semantics additions to HTML consist of tagging either the entire
> document or some portion of the document text with additional semantics
> over and above the baseline semantics of HTML. This additive model is
> great, because it allows tools to understand HTML without understanding
> all the semantic additions---tools which rely on just the semantics of HTML
> can safely ignore all the rest.
> Alas, this additive approach assumes only addition.
> When you have tei:ab defined using phrases like "...analogous to, but
> without the semantic baggage of, a paragraph", you have wording that is
> categorically not representable in any of the HTML semantic systems
> that I'm aware of.
But this conflates adding a label with adding "semantics". A label can
say anything. That "anything" can be subtractive.
In particular, there is a wonderful sort of ambiguity in the TEI's
definitions here. "ab" is defined with reference to "p" (implicitly, by
way of the word "paragraph"), but when you look to "p" to get a sense of
what the semantic baggage is that ab does without, you learn only that p
"marks paragraphs in prose", and (when you turn to Chapter 3.1,
The paragraph is the fundamental organizational unit for all
prose texts, being the smallest regular unit into which prose can
"Prose" is not defined here. One suspects that the definition would be
circular, as in "a sort of text made up of paragraphs".
Composition theory isn't much help. Prescriptive definitions of
"paragraph" such as the one proposed by Strunk and White (see
http://home.ccil.org/~cowan/style-revised.html#9) are interestingly
uninterested in describing paragraphs as they actually appear. (What's
even more fun: Strunk and White's own paragraphs, which contain
formatted lists, would not be valid HTML if marked as HTML p elements.)
But if this isn't the "semantic baggage" referenced in the definition of
"ab", what is? (I guess if the Guidelines don't say we can infer
whatever we like.) Maybe TEI's semantics are, in fact, ostensive, as in
"a paragraph is whatever I am choosing to mark up as 'p' (and an ab is
something else)". I certainly don't think it would be very difficult to
find TEI 'p' elements that contain only fragments of statements
continued elsewhere, or nothing at all (in a common kind of tag abuse in
which empty p elements are used to introduce vertical whitespace in
Given all this, it hardly makes sense to try and reconcile TEI's
semantics with those sought by some systems that perform "semantic"
inferencing by traversing from one label to another across a network of
In particular, you could have HTML div[@class="tei:p"] and
div[@class="tei:ab"] and either could say just as much, or as little, as
you decide they should.
But maybe that's exactly Stuart's point. :-)
Wendell Piez mailto:[log in to unmask]
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
Mulberry Technologies: A Consultancy Specializing in SGML and XML