Print

Print


Elizabeth,

Just a quick second to Wendell's:

> What isn't automatable is a process without an explicit specification,
> or a process whose specification is incomplete or ambiguous. (One
> might call the latter sort a process without a specification, or a
> process whose spec is only nominal and not actual.)

and to point out, automated or not, 'intelligence adding' will not be 
consistent or useful to others without specification, as complete and 
unambiguous as possible (there are always theoretical or practical limits).

Like Wendell, I am also interested in this conversation, here or elsewhere.

Hope you are having a great day!

Patrick

On 10/11/2012 10:10 AM, Wendell Piez wrote:
> Dear Elizabeth,
>
> On Thu, Oct 11, 2012 at 6:38 AM, Elizabeth H Dow <[log in to unmask]> wrote:
>>       Much of digital humanities production has traditionally involved a lot
>> of hand work, especially as it relates to higher level markup. I would be
>> interested in knowing what processes people have developed 'tools' to
>> automate. I'd also appreciate pointers to public conversations or published
>> materials about just what kinds of 'intelligence adding' we can automate and
>> what processes will probably always take human intervention.
> I'm not aware of any published materials addressing the principles of
> automatability and automation, although I'd be curious to know as it's
> an interest of mine. (Longish post follows.)
>
> However, I think the principles are fairly well understood in
> practice. For example, in the XSLT domain we make a distinction
> between "down" and "up" conversion. The metaphor is a slope -- going
> down is easy, going up isn't. Also, there are shallow slopes and
> steeper ones. A down conversion is one in which all the information to
> be represented in the target format is present in the source, and only
> has to be manipulated in some fashion. Note that the target does not
> have to be simpler than the source, as in this example:
>
> source:
> <url>http://sites01.lsu.edu/wp/slis/</url>
>
> becoming
>
> target 1:
> <a href="http://sites01.lsu.edu/wp/slis/">http://sites01.lsu.edu/wp/slis/</a>
>
> In this conversion, everything in the target is derivable from the
> source, and although it's more elaborate, the conversion is completely
> automatable.
>
> But consider this case:
>
> target 2:
> <a href="http://sites01.lsu.edu/wp/slis/">sites01.lsu.edu/wp/slis/</a>
>
> Here, a bit of string manipulation has happened: the "http://" has
> been chopped off the value in one of the places it appears in the
> result. We may infer that this manipulation is rule-based and
> automatable ("chop off the leading 'http://'), but we don't know for
> sure unless we actually have the spec.
>
> The need to perform operations of this sort is what makes this,
> potentially, an upconversion. (I say "potentially" because how steep
> the up-slope is here depends on the complexity of the specified
> operation, and it could be really trivial.)
>
> And then there's this case:
>
> target 3:
> <a href="http://sites01.lsu.edu/wp/slis/">School of LIbrary and
> Information Science, Louisiana State University</a>
>
> We'd probably say this wasn't automatable at all -- unless maybe we
> had a lookup table from which we could get the value in the result
> that isn't present in the source. (A list of resources with their web
> pages.)
>
> Notice the "unless" -- which is critical. Even the first case is only
> automatable -- is only a simple down-conversion -- if we assume a
> specification from the example:
>
> Target 1 specification:
>    For a 'url' element generate an 'a' element, whose 'href' attribute
> is the value
>    of the 'url' and whose value is also the value of the 'url'.
>
> Target 2 specification:
>    For a 'url' element generate an 'a' element, whose 'href' attribute
> is the value
>    of the 'url' and whose value is a substring of the value of the
> 'url', modified by
>    removing any leading substring 'http://'
>
> Target 3 specification:
>    For a 'url' element generate an 'a' element, whose 'href' attribute
> is the value
>    of the 'url' and whose value is provided by reference to the table
> of institutions
>    with their web sites (provided).
>
> All these are automatable.
>
> What isn't automatable is a process without an explicit specification,
> or a process whose specification is incomplete or ambiguous. (One
> might call the latter sort a process without a specification, or a
> process whose spec is only nominal and not actual.)
>
> And indeed, most of the work in building automated systems is not in
> implementation but in specification -- we have to write the spec,
> usually based on prior knowledge, observation and inferences, and then
> negotiate with stakeholders to see that it is correct.
>
> Of course this process itself is not automatable, since although we
> know its broad outlines, we can't ever know it in detail before we are
> faced with the case.
>
> It's also frequently an iterative process, since stakeholders
> frequently don't know what they want until they see something that is
> somewhat correct but not entirely.
>
> And this speaks directly to your larger question: which processes
> people have developed tools to automate.
>
> In fact, here I think we have three categories -- there's an
> in-between category of "partially automated", which is very important
> and growing more so.
>
> We have automated many processes for which we can develop specs. For
> example, converting TEI into HTML, operational over a finite and known
> data set.
>
> We have semi-automated many processes for which we have developed
> partial specs. For example, converting TEI into HTML, operational over
> an unspecified and open-ended data set conformant to the TEI
> Guidelines and valid to TEI-all.
>
> We have not automated any processes for which we can't have specs. For
> example, converting any XML into HTML, when we don't know in advance
> anything about the XML other than that it's well-formed.
>
> Certain interesting and hard problems, such as converting Word
> documents into XML tag sets with established semantics (such as TEI or
> NLM/NISO JATS), are starting to move from the "not automatable" to the
> "semi-automatable" category.
>
> In conclusion I note two things:
>
> 1. The middle category, as I said, is growing as we acquire more and
> better tools. While the need to specify correctly does not diminish,
> some problems and some categories of problems become easier because
> the specs for some of their parts, in effect, become embedded in tools
> to which we can commit up front. For example, if our spec can say
> "correct the incoming HTML using HTML Tidy", we don't have to specify
> what HTML Tidy will do. So certain problems, for practical purposes,
> move out of category 3 and into category 2.
>
> 2. Nevertheless, the really interesting problems, including the
> meta-problems such as "Design a TEI application capable of describing
> X", will remain in category 3. Again, the problem here isn't in the
> automation but in the specification.
>
> This is a fascinating topic and I'd be happy to correspond with you
> about it off-list.
>
> Best regards,
> Wendell
>

-- 
Patrick Durusau
[log in to unmask]
Former Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau