Print

Print


Grateful thanks to Peter, Syd and Martin for taking the trouble to answer, but I seem to have given everyone the impression that I want to transform a TEI text containing <c> tags into another text, or even two other texts.  That wasn't what I had in mind at all.

What I envisage is inputting a text containing <c> tags to a TEI-aware indexing or concordancing program.  Xaira is a program of this type, but, when it is extracting indexing terms (tokenising), I haven't been able to make it handle the <c> tags in the way which I might expect "non-lexical characters" to be handled, even when it is informed that the text is TEI-conformant, not just XML-conformant.

Briefly, a concordancing program (for example), written in a programming language, will read a text, extracting each token (dropping non-lexical characters within it) and noting the token's offset within the text, and putting a record into a file, which is then sorted alphabetically on the tokens.  This sorted file is then read back, and for each record, we display the token (still without non-lexical characters) and, going back to the text, display a segment from around the offset (this time retaining the non-lexical characters).  The output is a concordance, not another XML version of the text.

Concordance programs, which have been around for many decades, routinely handle non-lexical characters, which they call "padding".  The OCP manual (1979) states concisely "padding letters will be printed but otherwise ignored".  With these concordance programs, for any character declared as padding (a hyphen, say), *every* occurrence of the character is so treated.

With TEI markup, we can declare each instance of the character individually to be non-lexical or not, which is something I need to be able to do.  But few concordance programs can handle TEI markup, other than by stripping out the tags altogether.

A TEI-aware concordance program would do "the right thing" with every tag, including <c>.  If "non-lexical character" means anything, the right thing with <c> must be to omit or include the content depending on the operation.  Tokenisation demands omission, display of context demands inclusion, at different points in the concordancing or indexing process. 

The background to all this is that I have texts in non-TEI markup, and programs which index and retrieve them, an essential feature of which is to take account of non-lexical characters.  I was considering writing a conversion from my own markup to TEI, with the object of making the texts more widely usable.  But unless there is a TEI construct for non-lexical characters, and off-the-shelf TEI-aware programs for indexing, concording, etc. that implement it, not only outside <w> but also within <w>, there would be little point in such a markup conversion.