Dear Elizabeth,

On Thu, Oct 11, 2012 at 6:38 AM, Elizabeth H Dow <[log in to unmask]> wrote:
>      Much of digital humanities production has traditionally involved a lot
> of hand work, especially as it relates to higher level markup. I would be
> interested in knowing what processes people have developed 'tools' to
> automate. I'd also appreciate pointers to public conversations or published
> materials about just what kinds of 'intelligence adding' we can automate and
> what processes will probably always take human intervention.

I'm not aware of any published materials addressing the principles of
automatability and automation, although I'd be curious to know as it's
an interest of mine. (Longish post follows.)

However, I think the principles are fairly well understood in
practice. For example, in the XSLT domain we make a distinction
between "down" and "up" conversion. The metaphor is a slope -- going
down is easy, going up isn't. Also, there are shallow slopes and
steeper ones. A down conversion is one in which all the information to
be represented in the target format is present in the source, and only
has to be manipulated in some fashion. Note that the target does not
have to be simpler than the source, as in this example:



target 1:
<a href=""></a>

In this conversion, everything in the target is derivable from the
source, and although it's more elaborate, the conversion is completely

But consider this case:

target 2:
<a href=""></a>

Here, a bit of string manipulation has happened: the "http://" has
been chopped off the value in one of the places it appears in the
result. We may infer that this manipulation is rule-based and
automatable ("chop off the leading 'http://'), but we don't know for
sure unless we actually have the spec.

The need to perform operations of this sort is what makes this,
potentially, an upconversion. (I say "potentially" because how steep
the up-slope is here depends on the complexity of the specified
operation, and it could be really trivial.)

And then there's this case:

target 3:
<a href="">School of LIbrary and
Information Science, Louisiana State University</a>

We'd probably say this wasn't automatable at all -- unless maybe we
had a lookup table from which we could get the value in the result
that isn't present in the source. (A list of resources with their web

Notice the "unless" -- which is critical. Even the first case is only
automatable -- is only a simple down-conversion -- if we assume a
specification from the example:

Target 1 specification:
  For a 'url' element generate an 'a' element, whose 'href' attribute
is the value
  of the 'url' and whose value is also the value of the 'url'.

Target 2 specification:
  For a 'url' element generate an 'a' element, whose 'href' attribute
is the value
  of the 'url' and whose value is a substring of the value of the
'url', modified by
  removing any leading substring 'http://'

Target 3 specification:
  For a 'url' element generate an 'a' element, whose 'href' attribute
is the value
  of the 'url' and whose value is provided by reference to the table
of institutions
  with their web sites (provided).

All these are automatable.

What isn't automatable is a process without an explicit specification,
or a process whose specification is incomplete or ambiguous. (One
might call the latter sort a process without a specification, or a
process whose spec is only nominal and not actual.)

And indeed, most of the work in building automated systems is not in
implementation but in specification -- we have to write the spec,
usually based on prior knowledge, observation and inferences, and then
negotiate with stakeholders to see that it is correct.

Of course this process itself is not automatable, since although we
know its broad outlines, we can't ever know it in detail before we are
faced with the case.

It's also frequently an iterative process, since stakeholders
frequently don't know what they want until they see something that is
somewhat correct but not entirely.

And this speaks directly to your larger question: which processes
people have developed tools to automate.

In fact, here I think we have three categories -- there's an
in-between category of "partially automated", which is very important
and growing more so.

We have automated many processes for which we can develop specs. For
example, converting TEI into HTML, operational over a finite and known
data set.

We have semi-automated many processes for which we have developed
partial specs. For example, converting TEI into HTML, operational over
an unspecified and open-ended data set conformant to the TEI
Guidelines and valid to TEI-all.

We have not automated any processes for which we can't have specs. For
example, converting any XML into HTML, when we don't know in advance
anything about the XML other than that it's well-formed.

Certain interesting and hard problems, such as converting Word
documents into XML tag sets with established semantics (such as TEI or
NLM/NISO JATS), are starting to move from the "not automatable" to the
"semi-automatable" category.

In conclusion I note two things:

1. The middle category, as I said, is growing as we acquire more and
better tools. While the need to specify correctly does not diminish,
some problems and some categories of problems become easier because
the specs for some of their parts, in effect, become embedded in tools
to which we can commit up front. For example, if our spec can say
"correct the incoming HTML using HTML Tidy", we don't have to specify
what HTML Tidy will do. So certain problems, for practical purposes,
move out of category 3 and into category 2.

2. Nevertheless, the really interesting problems, including the
meta-problems such as "Design a TEI application capable of describing
X", will remain in category 3. Again, the problem here isn't in the
automation but in the specification.

This is a fascinating topic and I'd be happy to correspond with you
about it off-list.

Best regards,

Wendell Piez |
XML | XSLT | electronic publishing
Eat Your Vegetables