Print

Print


After thinking about some off-list responses to this query, I realize  
now that I should have stated this issue as more a question about  
search software than about specific textual features like lineation  
and hyphenation. I don't have any experience configuring search  
software, and it's not obvious to me whether indexing the content of  
certain elements while ignoring other elements is a common and  
straightforward configuration to make, or something only minimally or  
inconsistently supported. I'm speaking here of the more widely used  
open-source search products, not commercial ones. Is it reasonable to  
assume that any searching/indexing software worthy of the name can be  
instructed to ignore some elements, based on an XPath (like choice/ 
orig or orig[@type='eol']), while indexing others (like <reg>).

Thanks,
Greg


On May 15, 2010, at 12:58 PM, Greg Murray wrote:

> For TEI texts derived from printed books, I'm trying to come up with  
> a better way (better than what I used in P4) to deal with end-of- 
> line hyphenation. I'm thinking that something originally transcribed  
> (by a keyboarding/encoding vendor) to retain original lineation:
>
> &ldquo;By art is created that great Leviathan, called a Common-<lb/>
> wealth or State&mdash;(in Latin, Civitas) which is but an  
> artificial<lb/>
> man.&rdquo;
>
> could be converted programmatically to this:
>
> &ldquo;By art is created that great Leviathan, called a
> <choice>
>  <orig type="eol">Common-<lb/>wealth</orig>
>  <reg type="eol" subtype="removed">Commonwealth<lb/></reg>
>  <reg type="eol" subtype="retained">Common-wealth<lb/></reg>
> </choice>
> or State&mdash;(in Latin, Civitas) which is but an artificial<lb/>
> man.&rdquo;
>
> (This assumes that we've added @type and @subtype to our TEI schema  
> for orig and reg.)
>
> It seems to me that this kind of markup circumvents the whole  
> problem of project editors needing to decide (on a laborious case-by- 
> case basis) which spelling is "correct," and it avoids guessing  
> which form users will search for. Instead, any self-respecting  
> search software should be able (by instructing it to do so, via  
> configuration) to index both regularized forms, while ignoring the  
> original form -- allowing the search software to find the word  
> whether the user searches for "commonwealth" or "common-wealth".  
> (For display, we would probably retain the original hyphenation and  
> lineation while hiding the two regularized forms.)
>
> Are these assumptions about indexing/searching correct? Do others on  
> this list have a preferred method for making EOL hyphenation  
> amenable to searching?
>
> Many thanks!
>
> Greg Murray
> University of Virginia Library