Print

Print


On 03/03/18 15:23, Ciarán Ó Duibhín wrote:
> May I repeat this request, hopefully more clearly.
>  
> I would like to locate any program (preferably for Windows) for making
> indexes, word lists, or concordances from TEI text, and which will
> interpret the <c> tag in the following way, which I hope is in
> accordance with its description as "non-lexical character":  the content
> of the <c> tag is to be dropped in extracting tokens, but is to be
> included in displaying segments of text. 
>  
> For example, the text "an b<c>h</c>ean" should yield tokens "an" and
> "bean", but should be displayed as "an bhean".

I appear to have missed your first post on this, sorry.

Can you please clarify "extracting tokens" vs "displaying segments of
text" a little more?

In your example, if the normalized character data content is tokenized
on a space, it yields the two tokens "an" and "bhean".

In XSLT2, it is fairly trivial to pre-parse the original content before
normalization in order to omit the c element, yielding "an" and "bean".

But there is nothing to suggest that "an b<c>h</c>ean" is itself
contained in such a way as to create a token "an bhean" (for example, if
it was <p>an b<c>h</c>ean</p> rather than a paragraph containing  much
longer phrase of which "an b<c>h</c>ean" was merely part.

What you appear to want to do is easy with XSLT2 but it would be useful
to see a much larger example if you can post one (or send it privately).

///Peter
-- 
Peter Flynn | Human Factors Research Group | School of Applied
Psychology | 🏫 University College Cork | 🇮🇪 Ireland | ☎ +353 21 490
2609 | ✉ [log in to unmask] | 🌍 [log in to unmask]