Print

Print


David Sewell wrote:

>   <doc>
>      <name>Joe Smith</name>
>      <!-- lots of stuff -->
>      <name>Joe
>      Smith</name>
>   <doc>

[...]

> using the Pathan "xgrep" program:
>
>    xgrep '//name[normalize-space(text())="Joe Smith"]' foo.xml
>
> returns both <name> nodes. Or using the search function
> in XMLStarlet:
>
>    xml sel -t -m "//name[normalize-space(text())='Joe Smith']" -c .
foo.xml
>
> does the same thing.
>

The trouble is that this doesn't scale well for large volumes of data
(whether huge docs or numerous smaller ones). Both Pathan and xmlStarlet (or
rather the libraries they call) have to build a DOM tree before they can
fire off the XPath expression at it. And though libxml2 does this with
miraculous swiftness and Xerces is catching up, it's still a non-negligeable
overhead, especially on the memory-starved hardware which is still often
regarded as good enough for Humanities users.

For examples like the one David provides, there might be some mileage in a
SAX-based approach, since the XML-savviness needed to perform David's
example grep is confined to knowing about how to combine the content of
proximate text nodes and recognise "ignorable" whitespace.

But if the targets had to be located via a more complex  specification of
their hierarchical location and/or their attribute values,  then an xmlgrep
built on a SAX parse would need to maintain massive amounts of state, so
relying on well-written DOM parsing in libxml2 and similar libraries is a
safer bet, and the overhead seems inescapable. Except that, for my purposes
at least, I have found an escape, or rather a pragmatic evasion.

Take a task like: locate and display, within only those dictionary entries
whose lemma begins with "ca", all Anglo-Norman citations accompanied by
parallel glosses containing the Latin word "est"  taken from the Latin
version of the Hebrew Psalter, (but ignoring such glosses from the Vulgate
Psalter, or any other Latin source, and excluding all the instances of "est"
in Anglo-Norman passages). The sort of thing I need to do several times
before breakfast (OK, as Lou remarked yesterday, it's very hot here in the
UK at the moment).

The tool that does this for me is eXist, (i.e. its core command-line client
with its embedded interface to the backend store, not the spiffy but
somewhat bloated Cocoon-wrapped multi-tier package in its website shop
window). And it does so (from a repository with currently around 10,000
articles and several hundred other miscellaneous documents of varying sizes)
in around 300ms. Devising the XPath expression and typing it into the
console takes much longer than executing it and reviewing the results. Of
course, this is cheating, in the sense that eXist isn't grepping the
documents at all, it's scanning its own indices created when the documents
were stored into its repository. (So it's actually more like a
structure-aware counterpart of the now-defunct command-line glimpse than a
pure grep.) All the resource-intensive DOM building and walking has been
done beforehand and the results "frozen" in eXist's internal maps. But the
result is the same. I couldn't manage (and specifically I couldn't manage
the tortuous multi-lingual mixed-content data of the Anglo-Norman
Dictionary) without it.

[Those with vivid XPath imaginations will realise that to specify the above
target the backward-facing ancestor axis is required, which wasn't supported
in the last stable release of eXist (0.9.1). But that support is in the CVS
or the development snapshot, and will be in 1.0 Real Soon Now.]

Michael Beddow