> our criteria:
> - understand the XML documents (not just ignore the tags)
> - display search results with highlighting of some sort in file
> - run in Linux
> - basic functions of search engines
All this can be done by eXist, which is Open Source and GPL licensed.
The project site is a little confusing, since there are several modes of
deploying the core eXist engine, and the current documents don't clearly
distinguish between them.
The easiest mode to get going is to embed eXist within Cocoon. If you don't
have Cocoon already, installation under Linux or Win32 is very easy, all you
need is in the eXist downloadable package. If you are already running Cocoon
for some other purpose, integrating eXist into it is slightly trickier, but
can be done.
The second possibility is to connect to eXist to index documents, send
queries and receive results using XML:RPC, SOAP or even plain http. eXist
can be run as a self-contained server that supports these protocols, this
letting you use eXist with any programming language or applications
framework that supports one or more of them.
The third possibility, a cut-down version of the first which actually
functions like the second, is to wrap eXist in a stand-alone Java servlet
(Jetty is supplied with the download package) operating on a dedicated port
and again call it via http on the localhost from any programming environment
that can handle http.
The fourth possibility is to treat eXist as an embeddable library for a
dedicated Java app. This is the most flexible and powerful approach, but
unlike the others it presupposes you have ongoing access to a fairly high
level of Java programming expertise.
The one drawback in some people's eyes is that eXist, as the above makes
plain, does require Java to operate (though in all but the last mode I
described the Java components can be used as a "black box", with no
knowledge of that language needed to use eXist). On the other hand, this
also means that eXist will run on any platform that has a current Java
eXist, though still under active development, is in my view
production-ready, and indeed currently provides the search functionality
over TEI-XML documents within the Anglo-Norman Dictionary project.
You would probably need to cost in a little technical assistance to marry up
eXist to your data, including creation of scripts to index your documents,
and but once that's done there need be no significant ongoing costs.
After a lot of evaluation, the only other contender that is still in the
running for the role of core search engine within the AND project in the
longer term is Sleepycat's dbxml.
This does not require Java (though it has a full Java API). But it is not
yet production ready, and is at present simply a low-level C++ library
requiring considerable development skills and knowledge to wrap it up into a
functioning search engine. This means that although it also it Open Source
(though *not* GPL -- it is part of the Berkeley DB system and subject to the
same licensing conditions) there may be a considerable overhead in funding
local programming expertise to deploy it.
dbxml does not as yet provide the match-tagging functionality which eXist
has built in: that would need to be provided by locally programmed
post-processing of results. Moreover, its internal indexation is
trigram-based, whereas eXist indexes on discrete lexical tokens (or "words"
as less pretentious folk have been known to call them), making it more
readily amenable to concordancing and related uses.
dbxml has a highly standards-conformant inplementation of XPath as its query
language. eXist also uses XPath as the basis of its query language, but its
implementation of XPath is currently less complete than dbxml's though it
adds certain extensions which make it extremely valuable for querying
precisely the kinds of documents most TEI projects use. XPath is excellent
at specifying location via document structure, but (in 1.0 at least) is not
particularly rich in functions for matching text content. eXist adds a full
regular expression library which can be used within its extended XPath
queries to match both element content and attribute values.