>> Section B is relative to the various parameters that could be needed
>> to process the text by a particular tool.
>> For example, in the analysis tools we build here, we would need to encode
>> the following parameters :
>> - specific tokenization rules, including character classes definitions
>> particular element roles definition : elements composing or delimiting
>> words (num, abbr, w...), reading choosing rules (choice, corr, sic...),
>> etc. ;
>> - specific text processing parameters :
>> -- like "not to be processed" rule definitions : focusing on elements
>> like gap, note... ;
>> -- like "specific indexes" rule definitions : focusing on elements like
>> head, foreign, hi... ;
>> - application specific parameters :
>> -- like text/corpus partitionning definitions, section referencing
> Why your application could not record, in every
> encodingDesc/tagsDecl/tagUsage element, a fs of your own, containing an
> REFID toward an ID identifing your application, and where your application
> may record the information it need?
OK, let's forget about the <processingDesc> element proposal and try to use
the <encodingDesc>/<tagsDecl> element instead, as you propose.
We can probably do a lot using <tagUsage> to describe particular processing
instructions bound to each element.
Nevertheless, I can see limitations doing only like that :
- an application needs to express its parameters from its own point of view.
Typically there is no reason why the data model of an application should
embrace only or exactly the TEI content model expressed in a document
to process. Especially if, as you say, a TEI document should not be bound
to any specific application - and I agree to try to do the maximum to
that. On the other hand, we all know that it is not because we can use
a particular TEI element to encode something in a text that every TEI text
is encoded with every possible element at saturation of its content model.
That is, we have to be able to process texts at every stage of their life
So what do we do if a particular TEI document has not encoded
any specific information needed by the application data model ? It will be
difficult to express processing parameters on particular elements, supposing
they encode alone all the information needed or they just participate to it,
if they are not present in the document at all.
Let's take the example of an application which has to model sentences
containing words. There are tools to code this in TEI, and there are also
to try to discover this in TEI texts with or without the help of previous
What do we do if the application has to discover/read sentence boundaries
there is no precise encoding of it in the document ? For us, in that
think that it would be a pity not to try to discover sentence boundaries
possibly with the help of processing instructions coming from the header. In
particular, when there is no sufficient information available in the body of
the text itself.
For example, when there is no sentence encoding available, an example rule
we use is :
even if at the end of the content of a <div>/<head> element there is no such
thing as a "hard"
ponctuation character (which is often the case) - which is an heuristic we
to find the end of sentences - force the end of sentence there.
This kind of rule is active if we decide that section titles are part of
text to index.
I am not saying here that section titles must be made of sentences, but
the fact is that sentence boundaries are used a lot by tools like POS
chunkers, etc. and if it is decided that section title must be indexed after
example, the need of a sentence context artificially rise up, so we have to
In that example, the <head> element may be declared as a potential "implicit
sentence splitter". But there are other places with potential "implicit
and it can become tricky to declare this precisely and exhaustively from the
only point of view of each TEI element class.
- the processing of an element often/always depends on its sourrounding
at the precise place it is in a document. In the TEI universe, this can be
by siblings and ancestors of the element, if you know and trust the
of those elements. But it can also depend on informations coming from the
around the element and it can become tricky to express this in a general
declaration bound to only an element class in the header.
For example, if the <div> element is used a lot hierarchically with various
structural meanings in the same document, it can become cumbersome to
precisely different processing parameters, relative to indexing or table of
construction for example, with just the <div> class usage declaration.
So, it seems that <tagUsage> alone would be cumbersome to use to
declare various processing parameters (my initial section B, section A
being encoded in the <revisionDesc> element).
Fortunately, <tagUsage> has, in P4 and P5, a sibling called <rendition>
"supplies information about the intended rendition of one or more elements"
If, the <rendition> element was created to somewhat compensate for a TEI
encoding practice more oriented toward logical than presentation information
encoding, then it is a cousin of a <processing> element we could create,
maybe to replace
<rendition>, whose role would be to encode processing parameters not only
presentation - as the "rendition" name suggests (aka CSS or XSLT style
but also to any specific processing.
On another hand, if the "rendition" of a document was the only process
planned at a moment to apply to TEI documents, maybe it is time to
generalize a bit.
>> Please note that *any* software should be able to store information
>> there. For example, even general editors could store informations like :
>> - printed by
>> - last print date
>> - editing duration
>> - total editing duration
>> - document model used
>> - autoload on/off
>> - etc.
>> as can be seen in the metadata part of the ODT file format for example.
> I think that every comparison with an existing format is dangerous because
> the TEI is not associated with one application (or at least one
> well-defined class of application), as the ODT format is. Moreover, I'm
> not sure that a word-processor format goal is that closed to the TEI
> format goal.
We are at a moment when the book device is no more the only technology
available to, in the same time, be the physical support of a document and
help to access it through reading.
If the TEI <rendition> element is the "access through reading" reminicent
aspect of the initial, and still needed, reading device, then if we decide
access documents through over meanings, we have to decide how, in a
community wide standard way. I am not so sure that "word-processor" is
a well-defined class of application. But I can see the size of the community
supporting the (ISO capacified) ODT file format standard. And I think it
would be a pity not to situate our discussion with respect to a standard
that will store the metadata of the majority of our documents for a long
from now, even if it's "only" for word-prcoessing.
Thank you Martin for raising that debate.
Thank you Sylvain for taking the role of the one who tries to keep for the
longest time possible the (difficult) decisions that have already been made