I am currently contemplating a new text-encoding project for private purposes. In case this is relevant to this discussion, I have a collection of digital texts in the fields of economics, investment, finance, markets, risk, etc, and in a range of different file formats. I intend encoding these texts to define document structure and content to assist with indexing, searching, cross-referencing/linking, data extraction, and any other uses I can think of in the future for general learning and content re-use.
I have over the past couple of months familiarized myself with the TEI P5 guidelines as well as other XML schema including DAISY/DTBook, DITA, DocBook, EPUB, XHTML (aware of NLM but haven't yet looked at it), and XML generally, so I feel I now broadly understand the XML concepts of markup, schema modularity/customization, conformance, vaildation, format conversion/transformation and so on. I have come across the idea of single source publishing and found this presentation interesting (http://www.stm-assoc.org/2012_12_06_EProduction_Kasdorf_Upfront_XHTML.pdf) amongst others that got me thinking about what may be the best base/master/root format for my purposes.
For the sake of clarity (and I hope this does clarify as intended), by "base/master/root format" I mean if the project involves a combination of a number of existing schema (eg TEI, MathML, SVG, DTBook, DITA, XHTML, etc), and perhaps others yet to be defined, across a collection such that any single encoded document may include elements from any combination of schema, the base/master/root format is represented by the common root element of each encoded (single source) document. If all encoded documents have a TEI root element, TEI has been chosen/used as the base/master/root format.
With that in mind, how can a new text-encoding project best go about determining which format to settle on as a base/master/root format for a text encoding project?
If those who are now experienced with TEI were starting their first encoding project today, how would you go about assessing if TEI were the most appropriate schema to build around given the range of other alternatives now available? What process/methodology might you use?
Perhaps put another way, what advantages exist to using TEI as the base format for encoding a collection of digital texts over other formats/schema (eg DITA, DocBoook, NLM, XHTML, etc.)?
For general use cases (ie defining basic structure of a document), are there any clearly identifiable reasons to use TEI? Are there specific use cases that only TEI can address?
Given W3C efforts in recent years to modularize XHTML, and given the similarities between XHTML and various other formats including TEI (both text-encoding standards in XML format for the representation of texts in digital form, consist of modules, and can be subsetted or extended), why would a new text-encoding project settle on using TEI instead of XHTML or some other schema?
Although extensible and customizable (like many other schema), is TEI likely to be increasingly used for any specialized domains or purposes (eg the verse, linguistic analysis, manuscript description modules not found elsewhere)?
And a final related question - what is the likelihood of the core modules/elements common to other schema likely to be standardized further with those found in other schema and/or is there any such roadmap or work already under way?
While the guidelines and the website are great at explaining the TEI guidelines, I have been unable to find any discussion why a project would use TEI in lieu of alternatives. I came across a mailing post discussion that broadly raises this issue without any clear resolution:
I'm sure I am not the only one to ponder these questions as these issues seem to me to be enormously important for anyone contemplating a non-trivial encoding project well before any commitment is made to any particular schema/workflow or combination.
Thanks in advance for any robust discussion on any or all of the above.