This is a wonderful idea. I'll give it a good workout next week -- I
have several projects that can really make use of it, and one in
particular has several thousand TEI files, so it'll be a serious stress
test. I can throw 6 or 7 GB of memory at Java if necessary.
Sebastian Rahtz wrote:
> I have entertained myself recently by writing a utility
> which attempts to work out the minimal TEI customization
> needed to validate a collection of files.
> What I have done is create an XSLT (version 2) stylesheet which
> traverses a nominated directory tree looking for
> *.xml files which have <TEI> or <teiCorpus> root
> elements. It analyzes the collection of elements
> and attributes in the resulting corpus, and compares
> that to the whole of TEI P5. An ODD file is generated
> * loads the required modules
> * deletes any elements which are not used
> * deletes any attributes (including class attributes)
> which are not used by each element
> * for every attribute which has a TEI "data.enumerated" datatype,
> constructs a closed <valList> enumerating the values actually used.
> From this you can construct a target schema, obviously.
> Is this of interest to anyone apart from me? If so,
> I could do with some testing and feedback.
> Memory capacity is an issue, obviously. My test set
> is the XML files in the TEI P5 Guidelines "Test" directory,
> and it can run over all the Shakespeare plays in a few seconds,
> but it's not going to read a giant corpus without you have
> a big load of memory to assign to Java. Caveat emptor.
> Want to try? grab getfiles.xsl and oddbyexample.xsl from Sourceforge
> and run it something like this:
> saxon -o my.odd oddbyexample.xsl oddbyexample.xsl
> The script assumes you have the TEI package which has a file
> called "/usr/share/xml/tei/odd/p5subset.xml". If you don't
> have that, grab http://www.tei-c.org/release/xml/tei/odd/p5subset.xml,
> put the file somewhere, and add a "tei" parameter to point
> at it.
>  Warning: I don't think I can face
> adding the code to handle any or all of
> * deriving simplified content models (beyond what Roma already does)
> * adding new elements and deriving a content model
> * dealing with non-TEI namespaces
> * generating attribute datatypes with complex regexps
> * working out Schematron constraints etc
> but of course you are welcome to try yourself :-}
>  no, not literally! it's open source, free etc