It might interest you to know about the LEGEBIDUNA project.
We've created a corpus of administrative (not legal) documentation:
Official Bilingual Journals of the Basque Administration (almost 10
million words in each lang. Basque and Spanish). We're now tagging
the texts, i.e. recognizing administrative and legal formulae and
terminology, and their distribution in the texts' structure. Our DTDs
are deduced from the tagged corpora, i.e first we tag the text, then
we construct the DTDs.
Similar experiments have been reported in "Automatic generation of
SGML content models", Electronic Publishing, vol.8:195-206, by Helena
Ahonen <[log in to unmask]>.
Also you can have a look to Keith Shafer's Fred parser for automatic
DTD creation in http://www.oclc.org/fred/docs/papers/
For our project, we have a page in Spanish at:
Joseba Abaitua [log in to unmask] http://www.deusto.es/~abaitua
Facultad de Filosofia y Letras, Universidad de Deusto, Apartado 1
E-48007 Bilbao || Tel: +34-4-4139092 (Ext. 2292) || Fax: +34-4-4458916