I wish you all a happy 2012. May all your documents parse!
We are implementing the latest version of our US Epigraphy project
using Solr as the indexer/search engine, and would like to use the
Solr tokenizing/filtering mechanism to handle tokenizing of highly
marked up text properly, as well as being able to classical Greek
Note: inscription texts can have letter-accent combinations that don't
normally appear in Classical Greek, but we'll burn that bridge when we
come to it... :)
Have any of you used or created a Solr tokenizer or filter that
handles in-word markup?
I have found a tokenizer+filter class called
GreekLowerCaseFilterFactory. This apparently handles the ignoring of
accents. However, it is likely to have been written for Modern Greek.
Does anyone have any experience with the appropriate filters? Maybe
you've written one? Am I on the right track?
Any information on this subject would be of great help!
Center for Digital Scholarship