> Much of digital humanities production has traditionally involved a lot of
> hand work, especially as it relates to higher level markup. I would be
> interested in knowing what processes people have developed 'tools' to
> automate. I'd also appreciate pointers to public conversations or published
> materials about just what kinds of 'intelligence adding' we can automate and
> what processes will probably always take human intervention.
Wendell gives a great introduction to the challenge of precisely
defining what we want our automated tools to do. His example is about
the two-part process of (1) identifying an entity in a stream of plain
text like a URL, and (2) enriching it with either a link or a
description derived from some lookup method like a glossary.
This process is called "entity enrichment," and there are tools that
specialize in this; they will find URLs, dates, people, organizations,
etc. They typically use a blend of approaches. Finding a URL is a
simple pattern matching exercise. Finding dates requires a little
more sophistication, given the variety of forms for expressing dates,
not to mention languages; these tools often have ways for determining
what language a text is in. Finding abstract entities like people or
organizations requires natural language processing - algorithms that
identify sentence fragments, tokenize words, then apply parts of
speech, then classify the nouns as belonging to a given entity type.
For examples of such a tool, paste a few paragraphs of text into
OpenCalais (http://viewer.opencalais.com/) or Stanford NLP
(http://nlp.stanford.edu:8080/corenlp/). Many NLP or entity
enrichment tools focus on a single language or small number of
languages; one very multilingual offering (commercial) is Basis Tech's
Rosette Linguistics Platform
Having spent some time evaluating these tools, I think they are
promising only in that they can take "raw" text and give it a "crude
first pass" at adding metadata and structure, fairly analogous to how
OCR help one derive a usable (or not) bit of text from an image. Just
as with OCR, if accuracy is a concern at all, humans always need to
review the output of such automated tools. Also, it's one thing to
identify a string of characters as being a "person"; it's another
thing to identify which person it is and establish a firm link to an
entry in biographical database. It's up to you / your project to
decide whether it's more helpful to start with a the crude first pass
and review the machine's choices, or whether it's better to just
enrich/annotate the text yourself from scratch.
I think there is promise here for tools that take raw text, identify
parts of speech and/or entities, and facilitate the
review/correction/verification of the resulting text by humans.
Unfortunately I'm a bit out of date as to the current state of such
tools for assisting in the review. There may well be something out
there that does all of the above.
(Googling for TEI and NLP led me to this exciting project:
- but the link to the demo doesn't seem to be accessible.)
As to publications, others may have better suggestions, but we can look
forward to the forthcoming issue of J-TEI, which will be focusing on
NLP (see the description here
http://markmail.org/message/52gwhti4kxfhwsbo); these papers will be
coming from the 2010 TEI conference, where NLP was a theme
You can also find past discussions on this list at
Joseph C. Wicentowski, Ph.D.
Office of the Historian
U.S. Department of State