Hi All,
Just to add a few more data points, here’s how we currently do things at the SCTA (http://scta.info).
First, here’s a short write up of some modelling decisions: http://lombardpress.org/2016/08/09/surfaces-canvases-and-zones/
This post is basically a description of why I separate the concept of a Manifestation Surface from an Item Surface from a IIIF Canvas, and then how I link them together.
Then, at the TEI level, our editors basically embed the Manifestation Surface Id into the Milestone elements, like so: <pb ed="#S" n="2-r"/> and <surface n="2-r">
All surfaces get recorded as RDF Resources that can be de-referenced by a client using the information embedded in the TEI as seen above. See for example http://scta.info/resource/sorb/2r.
In the end, this means that when a user is reading a TEI text, a click event on a milestone element, for example, triggers a request to the RDF triple store for the corresponding surface, and from here the SPARQL query looks for the default ISurface and from here the default Canvas ID. Since, at present, in the IIIF world we cannot rely on canvases themselves to be “de-referenceable”, I ingest all canvas information into the triple store as well. Thus, the SPARQL query continues from the Canvas ID to the Service ID of the image itself. It then retrieves and displays the actual image.
See for example: http://scta.lombardpress.org/text/lectio1. Click on the link “S2ra” (or any other folio marker) and you should see the corresponding image appear on your screen retrieved from distributed libraries all over the world via IIIF. If you select the paragraph menu at the end of any paragraph and then select “Manuscript Images”, the same query is happening, but this time coordinate regions of the target zone are being used to request only the desired region of interest from the IIIF server. (This coordinate information is originally stored in the surface element in the TEI header, but gets converted to RDF when the text gets crawled and aggregated.)
That, at least, is what I currently do (
Thoughts and questions welcome.
Jeff Witt
On 6/29/17, 12:00 AM, "TEI (Text Encoding Initiative) public discussion list on behalf of TEI-L automatic digest system" <[log in to unmask] on behalf of [log in to unmask]> wrote:
There are 4 messages totaling 565 lines in this issue.
Topics of the day:
1. IIIF and facs (and TEI) (3)
2. IIIF and facs
----------------------------------------------------------------------
Date: Wed, 28 Jun 2017 22:11:35 +0900
From: KANZAKI Masahide <[log in to unmask]>
Subject: Re: IIIF and facs (and TEI)
Hello all,
I have a small experiment that connects TEI/XML data and images via
IIIF, which might be of your interest.
Linked First Folio [1] allows users to search words in Shakespeare's
plays and to reach a page that contains the results as well as
facsimile image of the page. It utilizes TEI and IIIF from the
Bodleian First Folio [2] to associate a word/phrase in XML and the
page image.
It does not directly use facs attr values in Bodleian TEI. Instead, it
uses pre-generated mapping between a page range in TEI and an image
resource in IIIF manifest as an Web Annotation, e.g.
{
"id": "p152",
"type": "Annotation",
"label": "Hamlet: Act 1, Scene 1, p152",
"body": {
"id": "nn4v",
"format": "application/tei+xml",
"source": "http://firstfolio.bodleian.ox.ac.uk/download/xml/F-ham.xml",
"selector": {
"type": "RangeSelector",
"startSelector": {
"type": "XPathSelector",
"value": "//pb[@n='152']"
},
"endSelector": {
"type": "XPathSelector",
"value": "//pb[@n='153']"
}
}
},
"target": "http://iiif.bodleian.ox.ac.uk/iiif/image/e6ad69d4-9b90-4afc-a32d-d4af0889f1b8/full/full/0/default.jpg"
}
Hope this would be relevant to the discussion.
best regards,
[1] http://www.kanzaki.com/works/ld/firstfolio
[2] http://firstfolio.bodleian.ox.ac.uk/
2017-06-27 21:26 GMT+09:00 Stutzmann Dominique
<[log in to unmask]>:
> Dear Peter, Georg, and all,
>
> since Georg invited me to contribute, here I share on the different,
> connected issues , esp. format (TEI, PAGE, IIIF), granularity, text and
> image, text as annotation and its visualisation, and software engineering.
>
>
> 1) Formats for linking text and image
>
> a) TEI and annotation coordinates
>
> In several projects, colleagues from linguistics and paleaography, including
> Alexei Lavrentiev and me, have felt the need to link closely images and
> (analyzed) text at the levels of words and characters.
>
> This led us to specify a stand-off TEI format to deal with <facsimile> and
> <text>. The format developped in the Oriflamms project is here (described in
> French):
> (part 1) http://oriflamms.hypotheses.org/1442
> (part 2) http://oriflamms.hypotheses.org/1510
>
> The main principles are that
> - the texts are encoded in TEI (or teiCorpus > TEI) > text and tokenized
> with <c> and <w> with @xml:id
> - the corresponding <facsimile> and <zone> declarations are in separate
> files, with @xml:id
> - there is one file per image linking the @xml:id from the textual content
> (at character, word, line, column, page level) to the graphic content.
>
> These <facsimile> declarations are stored in one file per image (in a
> distinct folder) and we create <zone> elements for page (as stated in the
> discussion, you may have several pages reproduced on one image), columns,
> line, word and character or punctuation. A word can cross the
> page/column/line break. A character can cross the word separation (it is
> quite rare, but it happens, e.g. st ligature across two words).
>
> Several corpora in this format are on the project's GitHub instance:
> https://github.com/oriflamms (start with a
> https://github.com/oriflamms/Test_Fontenay/).
>
>
> b) PAGE and TEI
>
> The PAGE format is dedicated to describing what is on a page. It does have
> some structured information that TEI cannot render in the same structured
> way. This facts naturally derives from an image-oriented format vs. a
> text-oriented one. The description of the layout has a very fine level of
> granularity in PAGE: for example there are attributes @colourDepth or
> @bgColour to give the information about "The colour bit depth required for
> the region" or "The background colour of the region".
> One can transfer this type of information in TEI, but often either in an non
> structured or in an non explicit way.
>
> For example PAGE @indented in RegionType may correspond to:
> - @rend="indent" at different possible levels : level of layout (<pb/>,
> <cb/> but applying @rend='indent' at these elements would be an abuse),
> level of textual analysis (<p> or <l>), or in a more neutral way: PAGE is
> neutral and does not provide any analysis, so converting a block should
> create <ab> rather than <p>.
> - if there is a fully aligned text: shorter lines at the beginning of
> paragraphs in TEI text + facsimile
>
> The main difference, from an intellectual perspective, is that PAGE is used
> to store data from HTR or OCR. So, any part of "understanding" has to be
> encoded additionally. For example, the order of reading, which is implicit
> in TEI (<ReadingOrder>).
>
> As a matter of fact, in all instances of PAGE files that I have seen, there
> was no information which we could not transfer straightforwardly from one
> format to the other without using unstructured <desc> elements.This could
> require additions to TEI (while remembering what "T" stands for in "TEI").
>
> c) IIIF and TEI
> As evidenced in Ben Brumfield's excellent contribution, one might find hard
> to see the upside of a very verbose format to store very small bits of
> information without being able to encode them in the full meaning of the
> word, that is to analyze it. One strength is that IIIF (as PAGE) allows to
> make the order of text elements (via annotations) explicit.
>
> 2) Granularity and big data: consequences for alignment and visualization
>
> As mentioned above, my colleagues and I are working on the text as image at
> word and character level. This level of granularity has a consequence on
> format and on the software we use. Indeed, annotating several hundreds of
> thousands characters manually is very time consuming and also challenging,
> and the coordinates of the zones have to be modified. The Oriflamms format
> described above does not make use of @facs but only uses @xml:id and links.
> It helps keeping the textual analysis in one file and the graphic analysis
> in other files and systems.
>
> The consequence for the software engineering is that we have
> - a routine on the software TXM to prepare a corpus from an edited text. At
> this stage, this routine requires a @facs on <pb/> to indicate the beginning
> of a new image, but you then can have several pages on one image.
> - a software to align TEI encoded texts with images and visualize, correct
> and validate the results in a linear and in a tabular form.
> This software is open source : https://github.com/Liris-Pleiad/oriflamms
> There is an .exe version .
>
> In the more recent European project HIMANIS, we have used TEI editions to
> nourish HTR (Handwritten Text Recognition) systems and provide our textual
> community with an indexed corpus of 147 medieval manuscripts in Latin and
> French. The result is a giant index in which you can search and set
> parameters about word confidence. Each word region on an image may be
> indexed with several recognition hypotheses (typically ten), each having a
> confidence level.
> If your haven't seen it yet, please have a look at (search engine)
> http://prhlt-kws.prhlt.upv.es/himanis/ and (instructions)
> http://himanis.hypotheses.org/105 (and please, don't forget to validate or
> reject the hits that are found). The project is not finished and we will add
> a lot of things, but, by now, you can search for words and spot them with
> coordinates on the image.
> From a data modeling point of view, the results of Key Word Spotting is
> typically an "annotation" in the sense of IIIF. There is no stringent
> reading order (in the worflow, there is a line recognition step and key
> words are typically spotted on the lines, but having a false line
> segmentation does not prevent key word spotting from being accurate). For
> "sequence search", we assume a top to bottom, left to right reading order,
> and "graphic proximity" on the page, but this is not a "phrase" search. User
> feedback on KWS results is annotation on annotations. The beta-Interface
> does not provide a visualisation of all annotations, but IIIF, despite being
> very verbose, would be a "natural" format to exchange those annotations on
> images.
>
> References:
> (on the software)
> http://ieeexplore.ieee.org/document/6981046/?reload=true&arnumber=6981046
> (on the purpose of the research)
> https://www.cairn.info/revue-document-numerique-2013-3-page-81.htm
> (on the alignment)
> http://dh2015.org/abstracts/xml/STUTZMANN_Dominique_From_Text_and_Image_to_Histor/STUTZMANN_Dominique_From_Text_and_Image_to_Historical_R.html
>
> 3) Data conversion and software
>
> A software like Transkibus https://transkribus.eu/Transkribus/, developped
> by READ, to which the University of Valencia (Spain) is partner, can deal
> with PAGE format and TEI in a very effective way and export correctly from
> one format to another.
>
> Implementing IIIF to visualize all annotations is an obvious target. But,
> going back to the discussion, we also wish to provide one linear transcript
> (the string build from the sequence of the hypothese with best confidence),
> reintegrated into TEI to allow for correction, validation and semantic
> encoding, representing the text as a text and not only in a conundrum of 40
> automated annotations with confidence level plus one or several annotations
> from (human) scholars or users, with or without reading order. From a
> logical perspective, it is not the same to identify let's say a quote in the
> text and to mark a sequence of canvas as being a quote. To me, a sequence of
> canvas is not a meaning, it is a graphic content that can be read.
>
> With some of the same partners plus the Library of Poitiers and Teklia, and
> with some funding from Biblissima, we want to make our developments in HTR
> IIIF-compliant. That is to produce text transcriptions (from HTR) in TEI
> format, as said above with <text>, <facsimile> and /links/ (even if the
> result is less efficient than in PAGE or IIIF for that matter, because it
> opens the way to linguistic and paleographic analysis), then publishing it
> as manifest and annotations for IIIF (each transcribed word or character is
> the content of an "annotation" on a particular "canvas" on the image), using
> IIIF API to present the results and collect feedback and corrections,
> keeping the needed ids and then re-nesting the results into TEI files
> (tokenized at word/character level).
>
> So, wrapping up, both in TEI and IIIF there are ways to link at a finer
> level of granularity than the page. If I understand correctly, one probably
> could do everything in IIIF that has been done in TEI, by annotating a
> sequence of annotation at word or character level and marking this sequence
> as being "tei:p" or "tei:l" etc. , but I am not sure that it would be a
> great benefit for one community or the other. In our projects, we had both
> directions: starting from TEI editions to create data on images and starting
> from image analysis to create textual content, and, as a paleographer,
> working on both text and image, I really am convinced by the need to have
> text analysis as well as annotations. The proposed strategy would be to use
> each format for what it is the most useful and to implement automated
> mechanisms to let our formats communicate in a seamless way, working at the
> finer level of granularity probably makes it easier.
>
> Best regards,
> Dominique
>
>
> Le Samedi 24 juin 2017 20h00, "Robinson, Peter" <[log in to unmask]> a
> écrit :
>
>
> Time for a little context, I think.
>
> The IIIF community is large, growing, and multifaceted (sound familiar,
> anyone?). For some time now, several of us (beginning with Domhnall Ó
> h’Éigheartaigh, Patrick Cuba, myself and various others) have been looking
> at how IIIF and complex texts might play together. This group now includes
> (among others) John Bryant, Ben Brumfield (whose post on this list sparked
> this discussion), Jeffrey Witt, John Howard, Rafaelle Vigilante and Nick
> Laicona. Many of us were at the recent IIIF conference in Rome, where we
> presented a series of ruminations on the potential (great!), technical
> issues (multiple) and possible strategies (far too many) on how we might
> link complicated texts, typically referencing information extending far
> beyond the page-based model of IIIF, with IIIF.
>
> No firm answers yet. Anyone who wants to join our group as we wrestle with
> all this, please email any one of us (in the distribution list on this
> email). I can imagine that some time in the future (the November TEI members
> meeting?) the TEI itself might want to look at linkages betwixt TEI and
> IIIF.
>
> Peter
>
> On Jun 23, 2017, at 9:38 PM, Christian-Emil Smith Ore <[log in to unmask]>
> wrote:
>>
>>
>> Hi
>> We looking for a decent viewer for the facsimiles of Henrik Ibsens
>> manuscripts where all texts ar transcribed as TEI xml-documetns by the large
>> project Henrik Ibsen's writings mostly in the 1990s. It is clear that
>> something like the universal viewer (http://universalviewer.io/) may do the
>> job. This is a IIIF thing. I studied the specification of IIIF before I read
>> Ben’s report from the IIIF. It is an easy match to view the facsimiles, but
>> it is harder to add advanced (meta)data outside the simple
>> open-annotation-universe. I read Ben’s restored Vatican talk and also the
>> notes indicating Peter Robinson’s view. A text is not a series of pages. In
>> any case I assume that it is easy to link from some viewer of
>> tei-xml-encoded text to an instance of the universal viewer, but may be not
>> so easy the other way round. The question is whether the data model in IIIF
>> is well suited for modelling texts in the way TEI recommend.
>> I will be interested in participating in a discussion about this.
>>
>> Best
>> Christian-Emil
>>
>> _____________________
>> From: TEI (Text Encoding Initiative) public discussion list
>> <[log in to unmask]> on behalf of Martin Holmes <[log in to unmask]>
>> Sent: 18 June 2017 19:27
>> To: [log in to unmask]
>> Subject: Re: IIIF and facs
>>
>> Hi Ben,
>>
>> I'd say there's a great deal more you can do than simply using pb/@facs
>> to point at the highest-res image; the Representation of Primary Sources
>> chapter has examples of using <surface> and <zone> to link components of
>> a transcription to areas on an image, and of linking to multiple images
>> at different resolutions:
>>
>> <facsimile>
>> <graphic url="page1.png"/>
>> <surface>
>> <graphic url="page2-highRes.png"/>
>> <graphic url="page2-lowRes.png"/>
>> </surface>
>> <graphic url="page3.png"/>
>> <graphic url="page4.png"/>
>> </facsimile>
>>
>> <http://www.tei-c.org/release/doc/tei-p5-doc/en/html/PH.html#PHFAX>
>>
>> Cheers,
>> Martin
>>
>> On 2017-06-18 04:24 AM, Ben Brumfield wrote:
>>> Dear Colleagues,
>>>
>>> Two weeks ago, Patrick Cuba, John Howard, Peter Robinson, Jeffrey Witt
>>> and I organized a discussion session on Connecting Text and IIIF at the
>>> IIIF Conference at the Vatican. While we each have different
>>> perspectives expressed by our lightning talks, we agree on the need for
>>> the TEI community to be involved in conversations about modeling text in
>>> IIIF.
>>>
>>> My own talk, "Text Beyond Annotations" is online at
>>> http://content.fromthepage.com/text-beyond-annotations-at-iiif-vatican/
>>>
>>> I'd be interested in discussing best practices for linking from TEI
>>> documents to page facsimiles hosted on IIIF image services. At the
>>> moment I think that the only option we have is to insert a URL to a
>>> maximum-resolution image into the *facs* element of *pb*. I'd like to
>>> preserve that option for TEI viewers that don't support IIIF, but is
>>> there anything better we could do?
>>>
>>> Ben
>>>
>>> Ben W. Brumfield
>>> Partner, Brumfield Labs
>>> Creators of FromThePage <https://fromthepage.com/>
>
>
>
--
@prefix : <http://www.kanzaki.com/ns/sig#> . <> :from [:name
"KANZAKI Masahide"; :nick "masaka"; :email "[log in to unmask]"].
------------------------------
Date: Wed, 28 Jun 2017 23:26:14 +0100
From: Peter Flynn <[log in to unmask]>
Subject: Re: IIIF and facs
On 06/18/2017 12:24 PM, Ben Brumfield wrote:
[...]
> I'd be interested in discussing best practices for linking from TEI
> documents to page facsimiles hosted on IIIF image services. At the
> moment I think that the only option we have is to insert a URL to a
> maximum-resolution image into the *facs* element of *pb*. I'd like
> to preserve that option for TEI viewers that don't support IIIF, but
> is there anything better we could do?
I seem to have missed or misunderstood something in the ensuing
discussion. Admittedly, I am looking at this from the point of view of
an implementer, not an encoder, so I have have a different focus.
I am assuming that:
a) an image-set for a document is on a server;
b) each page-image is addressable by a unique URI;
c) the URI uses some kind of counting-token for each page,
eg page number, folio, sheet, frame, etc;
d) this token is part of the accepted scheme scholars use
for this document.
It is (IMHO) the business of the encoder to ensure that the relevant
milestones recognised by the user community as the canonical reference
method for each document are included in the TEI markup for the
document, so that users can find out where they are.
Then the technology (eg XSLT) that serves up search results can
trivially locate preceding::mls[1] or preceding::pb[1] or whatever for
any given hit, and form the URI for the image that by definition will
include the location in question.
This separates the two mechanisms, allowing the adoption of different
server techologies on either side in the future with minimal recoding.
It does, however, depend on the encoding of the canonical reference
(milestone) data for the document, and there are of course documents
with more than one such reference method, and many with none at all (but
presumably at least do have page numbers or folios; scrolls are a
different problem); and it depends on the creator of the image-set doing
the same.
Do those two criteria present particular difficulties where IIIF image
hosting is concerned?
///Peter
------------------------------
Date: Wed, 28 Jun 2017 21:47:27 -0400
From: Ben Brumfield <[log in to unmask]>
Subject: Re: IIIF and facs (and TEI)
I'm delighted to see the interest from the TEI community in connecting/converting IIIF, TEI and related formats like PAGE, and have been following the discussion with interest.
I'd like to return to a more tactical question about TEI and the IIIF Image API. While TEI zones correspond well to IIIF regions, neither standard really requires us to use such subsets of a page image, as facs can point to a whole page image, and a IIIF canvas's image resource generally will display the entire page of a manuscript. I'd like to know more about what I suspect will be the most common case -- associating a page transcript with a page facsimile using facs to point to a IIIF-hosted document.
We can certainly point our facs attributes at a IIIF-compliant URL, but how do we indicate to a IIIF-aware TEI viewer that there is a IIIF image endpoint which can be used for deep zooming by a client like OpenSeadragon? I gather that the value of a facs attribute can refer to nearly anything, and need not be a URL. Is there a way to add IIIF-specific data to facs? Should that be better addressed by another attribute on pb?
I'm imagining that something basic like <pb facs="$ENDPOINT/full/full/0/default.jpg"> (which would work for a viewer unaware of IIIF) could be expanded along the lines of
<pb facs="$ENDPOINT/full/full/0/default.jpg" iiif="$ENDPOINT"> or perhaps <pb facs="$ENDPOINT/full/full/0/default.jpg; iiif=$ENDPOINT"> but this is really new territory for me, and could use advice on existing practice.
Thanks,
Ben
------------------------------
Date: Thu, 29 Jun 2017 02:29:00 +0000
From: Martin Mueller <[log in to unmask]>
Subject: Re: IIIF and facs (and TEI)
An interesting thread. A good opportunity for someone with larger technical chops than I possess to write a digest along the lines of TEI and IIIF in 2017: the State of the Art. I know next to nothing about the underlying technologies, but I sense from conversations with librarians that things are on the cusp of moving. So quite a few readers of this list might appreciate a digest of this thread.
On 6/28/17, 8:47 PM, "TEI (Text Encoding Initiative) public discussion list on behalf of Ben Brumfield" <[log in to unmask] on behalf of [log in to unmask]> wrote:
>I'm delighted to see the interest from the TEI community in connecting/converting IIIF, TEI and related formats like PAGE, and have been following the discussion with interest.
>
>I'd like to return to a more tactical question about TEI and the IIIF Image API. While TEI zones correspond well to IIIF regions, neither standard really requires us to use such subsets of a page image, as facs can point to a whole page image, and a IIIF canvas's image resource generally will display the entire page of a manuscript. I'd like to know more about what I suspect will be the most common case -- associating a page transcript with a page facsimile using facs to point to a IIIF-hosted document.
>
>We can certainly point our facs attributes at a IIIF-compliant URL, but how do we indicate to a IIIF-aware TEI viewer that there is a IIIF image endpoint which can be used for deep zooming by a client like OpenSeadragon? I gather that the value of a facs attribute can refer to nearly anything, and need not be a URL. Is there a way to add IIIF-specific data to facs? Should that be better addressed by another attribute on pb?
>
>I'm imagining that something basic like <pb facs="$ENDPOINT/full/full/0/default.jpg"> (which would work for a viewer unaware of IIIF) could be expanded along the lines of
><pb facs="$ENDPOINT/full/full/0/default.jpg" iiif="$ENDPOINT"> or perhaps <pb facs="$ENDPOINT/full/full/0/default.jpg; iiif=$ENDPOINT"> but this is really new territory for me, and could use advice on existing practice.
>
>Thanks,
>
>Ben
------------------------------
End of TEI-L Digest - 27 Jun 2017 to 28 Jun 2017 (#2017-148)
************************************************************
|