Thank you, Martin for an excellent segue to close a cycle, I was
engaged in some of this janitor work when I started.
I was originally looking at the EEBO/TCP TEI docs to help me with a
sample TEI header for GITenberg Project, which is a git-based project
attempting to convert Project Gutenberg texts into a reusable format.
See https://groups.google.com/forum/#!forum/gitenberg-project They're
currently trying to decide on a file format.
I have a partly done TEI header at
based on http://eco.canadiana.ca/view/oocihm.07539/7?r=0&s=1 /
http://www.gutenberg.org/ebooks/48562 which it doesn't look like I'm
going to have time to finish soon (due to the recent birth of my
If anyone else would like to complete the header and/or otherwise
engage with this community, I'd encourage you to.
...let us be heard from red core to black sky
On Fri, Mar 27, 2015 at 10:46 PM, Martin Mueller
<[log in to unmask]> wrote:
> This exchange between Stuart Yeates and Paul Schaffner has been very
> useful in drawing attention to some of the many curatorial problems that
> future TCP users will need to engage in differently collaborative ways as
> the texts move from a small and tightly controlled production environment
> into an environment where people may live and do "as they like"
> (Aristotle's pithy definition of democracy). There is also an increasingly
> powerful and mobile technological environment that supports forms of
> circulation hardly envisaged when the encoding of the texts began. I
> recently spent about $40.00 on a 32 GB USB stick in the shape of a key for
> a padlock or similar thing. It is so small that it takes a while to
> retrieve it from the clutter in my pants pocket, but is only half full
> when all the TCP texts have been loaded on to it. And they all fit
> comfortably on an iPhone or iPad. Hamlet's table of memory in a new world.
> I've just come from a three-day work sprint in Berlin on TEI Simple, which
> has the
> TCP corpora as one of its main targets. There was quite a bit of
> discussion about how to enrich the teiHeader with "performance indicators"
> that will tell users what types of information can be extracted from this
> or that text. Tomorrow I'll give a talk at the RSA about collaborative
> curation of "Shakespeare His Contemporaries," a subset of EEBO texts. I'll
> moderate a session that includes a very interesting project at McGill that
> targets the EEBO corpus. A lot is happening "out there."
> I am fond of pronouns, and especially of a little jingle by Senator
> Russell Long, a famous chair of the Senate Finance Committee:
> Don't tax you and don't tax me,
> Tax that fellow behind the tree
> As good a way as any of describing a bad way of doing things. In terms of
> pronouns, "data prep," to borrow a term from house painting, boils down to
> this: Can (or should) "I" prepare "my" data in such a way that "you" or
> "they" can make ready use of them? There is a lot of useful work that
> needs doing. Some of it involves the header. Seemingly trivial and
> mechanical things (xml:ids etc) can make life easier. Different forms of
> Named Entity Recognition would greatly enhance the usefulness of the
> corpus, especially if they are moderately interoperable.
> A useful buzzword in the life sciences is "research data lifecycle
> management." Some years ago, Brian Athey, the chair of computational
> medicine and bio informatics at the University of Michigan talked about it
> at Princeton workshop and said
> 1) agile data integration is an engine that drives discovery
> 2) It¹s difficult to incentivize researchers to share data.
> A distinguished endocrinologist at Northwestern laughed when she heard
> this and said "researchers would rather share their toothbrushes than
> their data."
> A few months ago New York Times had a good article about the boring but
> necessary work of "data prep": For Big-Data Scientists, 'Janitor Work' Is
> Key Hurdle to Insights
> There is a lot of undone "janitor work" in the EEBO corpus that needs doing
> before the texts will unlock their full query potential. Managing the
> promise of Athey's first and the threat of his second statement in a
> half-way tolerable manner will be a big challenge for the data communities
> of Early Modern Studies. Nobody will do it for them. We'll need to find
> ways in which what "I" do to "my" data will be useful to others, and what
> others do to "their" data will be useful to me. Above all it needs a
> recognition that this is work that needs doing and that it needs to be
> done by the researchers who will work with, and generate insights from,
> appropriately curated data.
> Martin Mueller
> Professor emeritus of English and Classics
> Northwestern University
> On 3/27/15, 03:27, "Paul Schaffner" <[log in to unmask]> wrote:
>>> Thank you for the explanation. Is there an authoritative document on
>>> the EEBO process which can be linked to from the <encodingDesc> ?
>>Aside from the keying guidelines and their chaotic supplements
>>here: http://www.textcreationpartnership.org/docs/ .. probably not.
>>And even those only cover the transcriptional capture, saying nothing
>>about the source of the bibliographic information, the storage of
>>metadata, the various conversions to XML, etc. They do, however,
>>include a list of SDATA character entities and their recommended
>>equivalents for purposes of effective display and (alternatively)
>>lossless round-tripping. Maybe when we finish this project, we
>>will know enough to be able to say what we did, as well as what
>>we did wrong. ... The details of Sebastian's conversions to P5
>>are bound up in his ant file and style sheets, which should perhaps
>>also be linked to at least indirectly.
>>> I think I interpreted the placing of the documents on github as an
>>> invitation to improve them, or at least suggest ways in which they
>>> could be improved.
>>Yes indeed. Thank you. Didn't mean to suggest otherwise.
>>In fact, I have no personal stake in the P5 headers at all,
>>merely feeling defensive about the shortcomings of the source
>>on which they draw. Improvements welcome.
>>* * *
>>> In most presentational scenarios, xml:ids end up as the anchors to
>>> which systems and end users can link. Introduction of ids (in both the
>>> body and the header) would seem to encourage the creation and
>>> persistence of reliable anchors for fine-grained linking and analysis.
>>> To this end, I'd have put xml:ids on at least all <div>s and all
>>> free-standing <p>s (<p>s not the direct descendant of a <div>).
>>Right, you want div-level IDs. Not unreasonable, I should think,
>>though maybe something difficult to coordinate among different
>>users with very different ideas of granularity. Some, for example,
>>are already adding IDs at the word-token level. Others are interested
>>only in drama (IDs on <sp>) or verse (on <l> and <lg>) etc.
>>Hard to know how to please them all. I think it fair to say that we
>>didn't put them in to begin with (even in the SGML) for the same
>>reason that we didn't put them into our (Michigan's) earlier text
>>projects like American Verse and the Corpus of Middle English, or
>>even (below the entry level) into the Middle English Dictionary,
>>namely that we operated within a tradition of light markup,
>>not all that far removed from the traditional library ideal of bulk
>>and unbiased presentation, or even the usual ignorant appeal
>>for 'plain text.' Adding IDs as hooks isn't in the same category as
>>interpretive markup, but it is ... messy, and the sort of thing we
>>traditionally left to after-market providers.
>>> I'm initially interested primarily [in the] header. A quick survey of a
>>> of documents suggest a relatively small set of the names used quite
>>> frequently that could be targets for intervention.
>>The only names I can think of that could be described as forming
>>a small set are those of the parties responsible for the keying,
>>editing, and publication of the e-texts (on the order of 100 people).
>>I take it you don't mean to include authors, publishers, etc.
>>of the original books? Which amount to an authority list for all of
>>early English print.
>>> Is there a TCP bestiary of unusual documents and corner-cases to test
>>> ones' assumptions against?
>>A sort of sampler of everything likely to appear anywhere? Including
>>all the nasty exceptions that are more than likely mistakes?
>>No. But I could think about how one might create such a thing.
>>Paul Schaffner Digital Library Production Service
>>[log in to unmask] | http://www.umich.edu/~pfs/