Hi James,

Le 02/11/2016 à 17:49, James Cummings a écrit :
[log in to unmask]" type="cite">One of the approaches I think more generalistic TEI software development should take is specification of input formats and provide generalised conversions from tei_all to the subset that the software does something useful with. (Ok, this is perhapsless relevant for things like TXM, general editors, or database frameworks.)  But if we imagine a new tool to display, visualize, or process TEI there is no reason it should necessarily cope with the whole of the TEI. It can use the TEI ODD customisation language to specify a meta-schema that it can handle. (And as Magdalena was noting including processing model information in that so it could act as a sort of configuration file for that processing.)
TEI simple is a step in that direction, for a specific processing model, of which we could perhaps augment the semantics.
What is difficult for us is:
1) to known our own processing model (we also have chameleons on our side...)
2) to represent our own processing model in such a container.
[log in to unmask]" type="cite"> If your software won't do anything with <w> elements and just ignore their existence then don't include them in the schema and let people get errors/warnings about them.
[log in to unmask]" type="cite"> If you have a fixed list of @type attributes your software expects on <name>, then document that in TEI ODD. And then through schema errors or schematron warnings a user can test if their source documents are processable by that bit of software. Even better if there is then a tei_all to MySpecialSoftware conversion script which throws away all the stuff this piece of software is going to ignore or fail on.
Currently in TXM we allow you to use <w> or not:
a- all words are encoded by <w>s (generally some tools have probably already done some work on a previous representation of the text)
b- or some words can be encoded by <w>s (very useful for progressive lazy encoding strategy: encoding/analysis cycle)
c- or no word is encoded by <w> (the most frequent case)
b) is the most difficult case to process, but for a) we still have to decide a decoding strategy to get the word form and other properties and for c) it is very often still difficult to decide the 'base text' character stream we must rely on to build words.
We also have to decide for which textual plane to index the words found in the stream: for the titles plane, didascalies plane, notes plane, body plane, etc.
So it is not only a matter of expressing constraints on the input representation but also to tune flexibility for its processing.
The constraints can also rely on information expressed outside of the representation, for example if the <w> encode morpho-syntactic information produced by a specific NLP tool we may need to check constraints on the m-s tagset which is very rarely defined in the teiHeader or somewhere in the text (typically because NLP tools change and rarely declare formally their outputs).
Another example is expressing constraints on encodings of ontological information in texts linking to ontology repositories.

[log in to unmask]" type="cite"> I know the next question will be why are we encoding it if we then throw it away -- and clearly the answer is we may throw it away for _this_ bit of processing or visualization or whatnot, but that doesn't mean it isn't crucial for other bits of analysis and research.  So your TEI Zero, for example, I can validate against its schema and if I don't get any warnings then I know that your software won't have a problem with it. If I do, I can judge if they are errors, or warnings like  -- "Your <name type="thingy"> will be treated as <name type="other"> in our software" -- then I can make an informed decision about how the software will work with my texts or whether I should convert them to match your values. I realise, of course, some people already do this, but it may be worth reiterating as it seems a lot more practical than people trying to develop software that will cope with any of the TEI vocabulary (never mind new things a project adds...).
TEI vocabulary is fine, it is the flexibility of its usage which is difficult to cope with.
Currently what we call 'TEI Zero' in the TXM universe has no format. It is just an under-constrained specification of a tiny TEI encodings subset we have processed the most in recent years for analysis and publishing and we want to help people work with.

-slh


-- 
Dr. Serge Heiden, [log in to unmask], http://textometrie.ens-lyon.fr
ENS de Lyon/CNRS - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883