Print

Print


On 06/01/12 05:26, Mylonas, Elli wrote:
> I wish you all a happy 2012.   May all your documents parse!
>
> Note: inscription texts can have letter-accent combinations that don't
> normally appear in Classical Greek, but we'll burn that bridge when we
> come to it... :)
>
> Have any of you used or created a Solr tokenizer or filter that
> handles in-word markup?
>
> I have found a tokenizer+filter class called
> GreekLowerCaseFilterFactory. This apparently handles the ignoring of
> accents. However, it is likely to have been written for Modern Greek.
> Does anyone have any experience with the appropriate filters? Maybe
> you've written one? Am I on the right track?

We do quite an expensive transformation prior to importing into solr.

One of the features of the transformation is that we duplicate entire 
paragraphs with names in our authority control system. Thus the fourth 
paragraph of 
http://www.nzetc.org/tm/scholarly/tei-HydParl-t1-g1-t24.html appears 
twice, once with the name "Mr. Isitt" and once with the name "Leonard 
Monk Isitt". We also add "Leonard Monk Isitt" to a field for faceting.

I can't over emphasise the importance of normalisation for solr faceting.

cheers
stuart
-- 
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/