Juan A. Trujillo's thoughtful posting on linguistic annotation described
these methods he'd tried or considered in his current work:
| METHOD ONE:
|
| BNC-type tags in the form of entity refs like: perro&NMS;, which could
| be expanded to "Noun Masculine Singular" or something similar.
|
| [...]
|
| METHOD TWO:
|
| Go ahead and use the <W> element, but modify the attribute list,
| adding part of speech, tense, number, gender, etc. and defining
| possible values for each.
|
| [...]
|
| METHOD THREE:
|
| Add elements for each part of speech that would be contained by the
| standard TEI <W> element as defined in the guidelines. Define attribute
| lists tailored to those parts of speech--no illegal combinations.
|
| [...]
|
| METHOD FOUR:
|
| Feature structures. [...]
I'd be inclined to say that Method Two (<W> with separate attributes for
different linguistic properties), the method Juan says he's leaning
towards, is a reasonable approach to the task, especially in view of the
time pressure involved; it creates fairly tractable data, and its lack
of validation of some features of the data doesn't seem to me to be a
severe failing.
Method Two models the information by using SGML features in a
straightforward way, such that it would be a simple task to convert such
a document into the existing forms used in linguistic research, such as
Method One. (In that case, it would be rather more work to make the
conversion in the other direction.) So the goal of making your data
useful with various systems across time with minimal pain would be
reasonably achieved.
(It's a form that could also be converted into one that uses feature
structures (as in the example in section 16.10 of the TEI guidelines,
for example) quite readily---and indeed if you wanted to use feature
structures something like Method Two could be a good way to create the
data initially, since feature structures were intended to be a powerful
system for encoding structured information, and not as something that
would be simple to type.)
The concern Juan states about this approach, and a motivating factor
behind the suggestion of Method Three (using subelements rather than
attributes to express at least some of the linguistic structure), is
that when you use attributes there isn't anything to stop you from
specifying nonsensical combinations. But, valuable as the validation
features of SGML are, there is always a point at which you'll find it
necessary, or even just more convenient, to write a program to examine
the data for problems, and I think that's the case here.
John Lavagnino
Women Writers Project
Brown University
|