On Mon, Jun 8, 2015, at 07:56, Eric Lease Morgan wrote:
> > On Jun 7, 2015, at 2:15 PM, Sebastian Rahtz <[log in to unmask]> wrote:
> > 
> > I don’t want to stop you having fun, Eric, but some of this is already done for you. From 
> > 
> >
> > 
> > you can download CSV and JSON versions of the whole metadata catalogue, and has a browsable sortable HTML table with all that data. Perhaps not all the data categories you want. The ability of jQuery datatables to cope with 60000 rows of data is rather awesome.
> The CVS file available from GitHub is very much like the one I created,
> and I have a question about the identifiers it contains. More
> specifically, is it possible to embed any of those identifiers (TCP,
> EEBO, VID, STC) into an actionable URL and get something meaningful back.
> Do they point to various incarnations of the texts? —Eric M., University
> of Notre Dame


The identifiers that we associate with EEBO TCP texts all mean
some are machine-actionable and some are not.


The TCP identifier identifies the TCP file in all its incarnations.
Given a
TCP ID of A69506, the original SGML version will be called A69506.sgm,
the P4 XML with header will be called A69506.headed.xml, and so forth.
Those versions of the files have not been mounted for individual
though if you had downloaded them all, this ID is the way to retrieve
from your local repository.

Sebastian has indicated how to use the TCP ID in a URL to get his files.
(And I previously indicated how to use it in a script to download
of the TCP P5 files.)

It may also be used to access the online versions of the files on the
TCP sites at Michigan and Oxford. Given a TCP ID A69506, this
URL will access the top-level page at the Michigan site:

and this URL will fetch the corresponding page on the Oxford Digital
Library site:;cc=eebo;view=toc;idno=A69506.0001.001

Both of these pages are HTML generated in the traditional way from
indexed XML.

 [One used to be able to pull down the corresponding page at the
 PhiloLogic site (Chicago/Northwestern)
using just the TCP ID, but I'm not sure if that is still true. At least,
I can't immediately see
how to do it if it is.]

2. The 'VID' is an image-set identifier for ProQuest's EEBO product. If
your institution
is an EEBO member, and your VID is 94927, then the URL for image 1 of
image set
94927 is

or (in an old-fashioned shorthand that still mostly works):

Every EEBO page image may be uniquely identified thus by a combination
VID and sequence number ("94927/1").

3. The EEBO ID (which in some of our metadata is called a "BIBNO") is
the ProQuest ID for the bibliographic item (i.e. for the catalog record
the title, as opposed to the image set for the copy). Give an EEBO ID
of 12880700, this URL:

fetches the bibliographic record for this item from the ProQuest EEBO

In many cases (but not all) these numbers are actually OCLC accession
numbers, and
therefore may be used in OCLC (e.g. in OCLC Connexion) to retrieve the
OCLC version
of the same record.

4. The STC numbers (sub-typed as Wing, Pollard&Redgrave, Evans, ESTC,
etc.) are all
forms of short-title catalog or (in the case of ESTC) full record
catalog. Most of
these are only human-actionable, unless you happen to have an electronic
copy of catalog in question. The ESTC number, however, can be used to
access the ESTC version of the bibliographic record for the item in
I don't recall if this can be embedded in a URL. Try it and see.


Sebastian indicates that his page counts are generated by counting
PBs. This can only be an approximation, since in many books the
same page may be referenced multiple times (ie. one page may generate
multiple PB tags.)

Our headers also contain image counts (and I think Sebastian retained
those in his?).
If he did, they too are an approximate surrogate for an actual page
count, since
they count total number of page images in the image set, but (1) some of
may be duplicates; and (2) most but not all of them (in EEBO at least)
two-up images. Some are one-up (one page per image). and in a few cases
a single broadsheet page may be distributed over as many as a dozen


Paul Schaffner  Digital Library Production Service
[log in to unmask] |