Comes this week. Was on vacation in May.
On 05/22/2012 11:59 PM, Stuart A. Yeates wrote:
> On Wed, May 23, 2012 at 1:28 AM, Peter MacDonald<[log in to unmask]> wrote:
>> We have a new repository here at our small college that uses
>> Islandora/Fedora as its delivery system and one of the collections consists
>> of 456 TEI files. We use Solr to index our TEI files, but it has been a
>> challenge to configure Solr to meet our delivery needs. You can see it at
>> work here:
>> So far we only index roleName, forename, surname, orgName, geogName,
>> geogFeat, and sic. We are also experimenting with full persName and
>> placeName elements, but those are not quite ready for primetime because our
>> TEI files are not yet 100% consistent in their encoding of these elements.
>> We eventually want to use the placeName elements in a mapping interface.
>> Anyway, I wonder if you would be willing to share the file you use for
>> configuring Solr for TEI. I've never seen it done by anyone but us and there
>> might be a more efficient way of doing it. I would be happy to send you mine
>> if you were interested.
> I'm not sure that our particular config will help you very much, but
> I'm happy to share what I learnt.
> * We use the same XSLT tool chain for solr as I use for HTML, so we
> reuse the authority control, normalisation and page layout code we
> already have. We generate .solr files (the naive solr xml format) from
> TEI xml files in an off-line manner, it takes about four weeks to do
> all our TEI, but we usually only have to do changed TEI. The solr XML
> files don't understand xml:lang or xml:id, alas.
> * I put everything I could possibly think of into the our solr index.
> Less than a third of those facets are actually used in the live web
> search, but they're all there in the live index. I initially did this
> because I was unsure of what facets we wanted to expose (at that stage
> 'improve multilingual search' is as far as we'd got in terms of
> planning a UI). It's turned out to be very useful for building a
> handful of custom searches.
> * For proper names (i.e. any with authority control), we create two
> fields, so for ship we have :
> <field name="ship" type="string" indexed="true" stored="true"
> <field name="ship_label" type="string" indexed="true" stored="true"
> The first is based on the @key and the second based on the authority
> -controlled preferred term for the ship. The name as written is added
> to the field which holds the full text. When we're building the UI,
> the nth ship goes with the nth ship_label. We use AACR2 rules for
> authority control, with the long term aim of aligning with other parts
> of the library. Solr is very adept at exposing the quality of our
> authority control and data normalisation.
> * We map<change> tags to to:
> <field name="change" type="string" indexed="true" stored="true"
> This let us query on particular changes which have been made to the
> text (there is no UI for this field, so it's inaccessible unless you
> can craft special URLs). I had grand plans to use this to implement a
> workflow tool, but it never happened, alas.
> * The core of our multilingual search is done using:
> <dynamicField name="Text_*" type="text" indexed="true" stored="true"
> The '*' matches a language code ('en' 'mi' 'rar' 'rap' etc). No new
> config is needed when we add a small amount of a new language to the
> corpus. Thus in the files to index we see:
> <field name="Text_mi">whakatika</field>
> <field name="Text_en">straighten</field>
> * Our default search is managed with:
> <solrQueryParser defaultOperator="AND"/>
> <copyField source="*" dest="all"/>
> This copies all fields in the record into a field called 'all' and
> uses it as the default search field using 'AND'
> * We index documents (=TEI files), pages (=textual web pages on our
> site) and images (=image pages on our site) in different ways. The
> document is _just_ the stuff in the TEI header. The page is the full
> text + references in the text + the TEI header. The image is the image
> caption / text alternative + the TEI header.
> * We found it pretty trivial to create an RSS feed from the search
> results rather than an HTML file, and indeed this is our main way of
> serving ePubs. See the RSS link at the bottom of
> http://www.nzetc.org/tm/scholarly/facets/search . RSS feeds for ePub
> purposes generally need to be faceted to only include documents or you
> get duplicates.
> * Designing a UI to solr is surprisingly hard. Everyone has their own
> ideas about what the most important facets are and which small
> sub-collections are worthy of greater prominence. Our current UI
> silently ignores quite a bit of stuff (such as text in Russian or
> Greek and ships) in favour of stuff that's we're currently being
> funded to produce (Pacifika ethnography, VUW history, etc).
> * We made an explicit decision not to compete with google. When we
> asked people what they wanted from search of our site, many of the
> suggestions we got back were easily doable using a simple google
> search and site:nzetc.org. We silently ignored these.