Robert Cover has pointed out what he thinks is an ambiguity in the W3C
definition of xml:lang. He also mentions TEI. Has his criticism been
answered? Is it valid?
Since a number of us are working on multilingual texts, let me cite the
main part of his argument here (more on:
All the best
Evaluation of xml:lang:
The implied inheritance of a language property (viz., the xml:lang
value) by subelements in the instance hierarchy may be considered a
very useful feature. However, the prescribed semantic "is considered to
apply to all attributes and content of the element where it is
specified" may be regarded (arguably) as suboptimal for tagging
multilingual text, or even for annotating a text in a single language
"foreign" to the markup specialist. In some settings, xml:lang may
simply be unusable if the semantic prescription of the XML 1.0
specification is to be honored. Details follow.
Section 2.12 describes the use and meaning of xml:lang as follows:
* ... to specify the language used in the contents and
attribute values of any element...
* ...The intent declared with xml:lang is considered to apply
to all attributes and content of the element where it is specified,
unless overridden with an instance of xml:lang on another element
within that content.
DTD authors will naturally want to design markup constructs (e.g.,
element type names, attribute names, attribute value name-tokens in an
enumerated attribute type) for their users in terms of the users'
native language. That is: users want markup labels (XML "names") to be
in their first language. Even more critically: if users are required to
supply a short phrase-level descriptor as CDATA content for an
attribute, they naturally want to think and write in their own
language. The XML specification seems not to allow this in cases where
the element content is declared to be in some other language. The
phrase "all attributes and content" seems to require that a global
language assertion would be made by the use of xml:lang in any element.
Example #1: The TEI (P4) DTD defines a <q> element for quoted
speech; this element has two CDATA attributes ('who' and 'type') as
well as an enumerated-type attribute 'direct' with attribute type and
default value (y | n | unspecified) "unspecified". Using the TEI P4
'lang' attribute (a global IDREF attribute indicating the language,
writing system, and character set associated with a given element), the
following <q>...</q> encoding would be sensible for an English-speaking
student wishing to mark up a German quoted phrase: <q lang="de"
who="Hans" type="spoken" direct="unspecified">bei mir</q>. The
following would not: <q xml:lang="de" who="Hans" type="spoken"
direct="unspecified">bei mir</q>. The prescribed meaning of xml:lang
seems to require a declaration that the terms "spoken" and
"unspecified" (at least) are in German, as well as "bei mir." This is
not a boundary case, as the TEI DTD has dozens or maybe hundreds of
CDATA attributes which invite substrings "in" the native language of
the encoder, which would conflict with the semantic for xml:lang in a
bilingual or multilingual encoding environment. It is unclear how the
TEI editors could entertain a proposal to substitute the xml:lang
attribute of XML 1.0 for the TEI P4 lang attribute in the P4 XML DTD,
given the scope specification for xml:lang. </q>
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around