On 04/03/18 14:59, Ciarán Ó Duibhín wrote:
> Grateful thanks to Peter, Syd and Martin for taking the trouble to
> answer, but I seem to have given everyone the impression that I want
> to transform a TEI text containing <c> tags into another text, or
> even two other texts. That wasn't what I had in mind at all.
I suspected that might be the case, hence my cagey wording.
WARNING: you need to use a fixed-width font like Courier to read the
examples I give below.
> What I envisage is inputting a text containing <c> tags
Can we be clear; do you mean a valid (or at least well-formed) TEI XML
document which allows character-level linguistic markup? Or do you mean
just a chunk of text with pointy brackets around the letter 'c'?
> to a TEI-aware indexing or concordancing program. Xaira is a program
> of this type, but, when it is extracting indexing terms (tokenising),
OK, another point of clarity needed. "Tokenising" in that sense may or
may not be the same thing as the operation performed by the XSLT2
function tokenize(). The XSLT2 function returns a sequence of atomic
objects which were identified because they were separated by the
specified delimiter. So tokenize($string,' ') when $string is the
sentence "All is discovered. Flee at once!" will return six words,
keeping the case and the punctuation. This may not be what Xaira means.
> I haven't been able to make it handle the <c> tags in the way which I
> might expect "non-lexical characters" to be handled, even when it is
> informed that the text is TEI-conformant, not just XML-conformant.
To be frank, I'd give up on what appears now to be an unsupported
utility if it isn't possible to do what you want. You just need to
define sufficiently for (eg) XSLT2 what you want to do.
> Briefly, a concordancing program (for example), written in a programming
> language, will read a text, extracting each token
"Token" being what in this case? A word?
> (dropping non-lexical characters within it)
OK, those identified by the c element type, or a list of characters to skip?
> and noting the token's offset within the text,
Ah. That's an entirely different <insert your own cultural meme: mine is
a kettle of fish or a pair of sleeves>. Is the text normalised (all
multiple spaces and newlines converted to single spaces) first? Is the
presence of preceding non-lexical characters to be included in the
offset or not (presumably yes, otherwise it will never align)? And is
the additional space occupied by the TEI markup itself also to be taken
into account? Does the offset re-zero itself at points in the document
(eg start of new sections)?
> and putting a record into a file,
What kind of record is this? A single line of unmarked characters? What
determines the start and end of a record?
> which is then sorted alphabetically on the tokens.
You mean the *content* of the record (presumably tokens with their
associated offsets) is sorted? Or the records themselves (on what)?
> This sorted file is then read back, and for each record, we display
> the token (still without non-lexical characters)
The implication here is that 1 record = 1 token = 1 word. Is that
correct? In other words, for my earlier example, sorted:
> and, going back to the text, display a segment from around the offset
> (this time retaining the non-lexical characters). The output is a
> concordance, not another XML version of the text.
OK, now we are getting somewhere. This is called KWIC format (KeyWord In
Context), and was (is?) the standard output of text searches in the days
of unmarked text, and into SGML days (in the CELT project we used PAT
for searching SGML TEI P2; it was [a] blindingly fast, and [b] returned
KWIC). In the above example, with a span of 20 characters either side,
we would get
1. All: ...ng is the sentence "All is discovered. Fl...
2. at : ...is discovered. Flee at once!" will return...
3. discovered: ...he sentence "All is discovered. Flee at o...
> Concordance programs, which have been around for many decades, routinely
> handle non-lexical characters, which they call "padding".
Normally you would define a list of these: comma, period, semicolon,
etc. I think what confused the issue was that you were giving an
alphabetic letter in the c element.
> With TEI markup, we can declare each instance of the character
> individually to be non-lexical or not, which is something I need to be
> able to do. But few concordance programs can handle TEI markup, other
> than by stripping out the tags altogether.
Right. But it doesn't sound terribly difficult, and XSLT2 is IMNSHO
ideal for the purpose.
> A TEI-aware concordance program would do "the right thing" with every
> tag, including <c>.
I suspect the definition of "the right thing" is different for every TEI
project of any significant magnitude. The CELT project has gazillions of
instances of the character-level element types used in linguistic markup
combined with the standard TEI features for editorial intervention,
semantic correction, lemmatisation and parallel readings, and physical
aspects like the rest of the name has been gnawed by rats. And every
project has its own list of "weird stuff", like we need lg within head
because some titles include fragments of poetry.
> If "non-lexical character" means anything, the right thing with <c>
> must be to omit or include the content depending on the operation.
> Tokenisation demands omission, display of context demands inclusion,
> at different points in the concordancing or indexing process.
Yep. All doable once "the right thing" has been defined.
> The background to all this is that I have texts in non-TEI markup,
> and programs which index and retrieve them, an essential feature of
> which is to take account of non-lexical characters.
I haven't had to do this at corpus level for many years; I would be
surprised if someone hasn't already done this in XSLT2 for TEI.
> I was considering writing a conversion from my own markup to TEI,
> with the object of making the texts more widely usable.
That would be a very generous and public-spirited action.
> But unless there is a TEI construct for non-lexical characters, and
> off-the-shelf TEI-aware programs for indexing, concording, etc. that
> implement it, not only outside <w> but also within <w>, there would
> be little point in such a markup conversion.
Apart from Xaira I don't know of anything off-the-shelf. But as Syd
implied, handling the text is not the problem; the problem is defining
what needs to be done for every element type in the TEI schema/DTD that
you are using.