I went through yesterday's very lively exchange about "html from tei" and
put all the postings into the memo below in the order in which they
appeared in my email program. I read Sebastian's initial question in the
context of the TEI Simple project, for which I will be drafting
documentation. So this is a useful exercise, at least for me.
Sebastian sent me an XML file that extracts the 28,000 structured notes
from 2,500 TCP texts. I spent a little time browsing through them. A large
number of them are like the note below from Erasmus' Apothegms and weigh
in below Twitter size
<note place="marg" anchored="true">
<p>A mannes fame is the chief odour y^t he smelleth
<p>Contynually to smelle of sweet odours is an eiuill
sauour in a manne.</p>
There is a question in my mind whether the <p> elements in notes of this
type are actually necessary and whether one could treat them as if they
were instances of
<note place="marg" anchored="true">
A mannes fame is the chief odour y^t he smelleth of. Contynually to smelle
of sweet odours is an eiuill sauour in a manne.
In which case they would seem to fall in a category of marginal notes for
whose rendering there appear to be plausible solutions. I will spend part
of my weekend working on some form of a complexity test. What are the
types (and tokens) of notes that pose difficult challenges and are there
ways of algorithmically identifying them by length and structure? I'll
tell you may answer when I have it.
if this isn't your area of interest, stop reading now.
I want some ideas about rendering marginal notes in HTML.
EEBO TCP has _many_ examples of very complex structured structured notes
with paragraphs, lists, tables etc within them. So naturally I expect to
make the <note> into a <div> with CSS properties to make it float left or
right. Fine so far. But the <note> in the TEI XML occurs inside (shall we
say) a <hi> element inside a <p>, which I would naturally be rendering as
a <span>. but a <div> inside a <span> is not allowed in the tiny brain of
HTML, because it doesn¹t realize its being floated off.
So what's a boy to do?
a) forget about HTML validity and let the browsers just do it. problem:
epub checking fails, and its possible some epub renderers will therefore
b) make everything, everywhere, be a <div> and sort it out with
display-style in CSS
c) make everything, everywhere, be a <span> and sort it out with
display-style in CSS
d) use HTML5 <aside>, but no thats a flow-level element too
e) move the <div> outside the <span>, up enough levels until its valid.
but that mean it loses context
f) split the <span (and any ancestor <span>) in two, insert the <div>,
and restart the <span>
g) scream and shout and kick
I would do e), i.e. move the <div> outside the <span>, up enough
levels until its valid. Put the notes into a separate container, create
hyperlinks between text and note, and if display in context or something
Ah, you mean I could put all the notes at the end, and float them into the
margin at run time using JS. That's not a bad idea.
Semantically purer too!
Editor in Byzantine Studies
I would do b), i.e. make everything, everywhere, be a <div> and sort it
out with display-style in CSS
As I understand it HTML was designed to let <html:div> be a generic
segmenting tool for anything that might (but not necessarily) deserve new
lines (blocks), and <html:span> be anything that definitely didn't (inline
segments). I think it would be a mistake to assign <tei:p> to <html:span>,
unless if you're using <tei:p> to annotate text segments that don't have
new lines. (I've seen sort of use of <tei:p> before, legitimately, most
notably of biblical texts.)
Despite the disallowance of html:span/html:div, in CSS rendering you can
treat either <html:div> or <html:span> as block, inline, etc. For this
reason I now tend to avoid <html:span> altogether in any HTML I create,
and simply make sure I type any <html:div> via @class. In fact, one could
even dispense with <html:p>, <html:h1>, <html:h2>, etc., which are, at
heart, just simply types of divisions.
But isn't it just as abusive to map <tei:hi> to <html:div>? swings and
I am operating on the assumption that we had better be prepared for the
CSS not being applied as we intended (someone may swap in a different one
for audio rendering, for example), so preserving the basic div/span
distinction in HTML seems pretty important to me. I may well be
the downside of the more semantically pure ³put the notes in a separate
container anyway² is that marginal labels like ³Tim. II 21² or ³234 BC²
are not <aside>s in the same sense as normal marginal notes.
I think it's a reasonable assumption that the CSS will not always be
applied as intended. E.g a web archive environment might leave it out, or
a tool for text analysis.
responding to Sebastian's comment that "marginal labels like ³Tim. II 21²
or ³234 BC² are not <aside>s in the same sense as normal marginal notes":
Which is why God gaue us the @type attribute on <note>
Perhaps a dumb question. But in that case why not deemphasise the html?
I.e. serve out the xml and either style as is or do client side
It seems to me that the problem you are setting yourself is rapidly
becoming how can I preserve the semantic granularity of the original TEI
in an HTML text that is used for interchange without negotiation, and I'd
guess the answer is going to be 'you should design something like the TEI
to do it.'
How do I do this in HTML in this specific context is one question. How do
I do it so that it maintains its semantic integrity in unknown contexts is
a whole other one.
responding to Daniel O'Donnell's question "why not deemphasize the html"?
because I want to deliver ebooks, where client side transform isn¹t really
an option. and even then, it doesn¹t seem safe to dynamically create HTML
which messes up its fundamental concepts about flow and inline
[Responding to the second part of O'Donnell's post:]
yes and no. but I am not talking about _complete_ semantic interchange,
but lossy downgrading to the universal interchange format while trying to
keep the limited semantics which HTML _does_ offer. Which is why, for
example, I am inclined to see if I can follow Ron¹s idea, but implement it
I don¹t think I can go as far purely presentation HTML using only div/span
responding to SR's question "isn't it just as abusive to map <tei:hi> to
I'd say no: <tei:hi> is but another way in TEI to segment/divide text,
which is exactly what <html:div> was meant to do.
I've been transforming a collection of structured texts with analogues to
<tei:hi> elements. I've been very pleased with the flexibility and
expressiveness in <html:div>. IMO, ordinary human intelligibility is not
as important in HTML as it is in TEI (after all, what portion of readers
actually look at a page's source?). Plus with this approach you can write
more concise and expressive css that doesn't require you to worry about
the hierarchy, or even the name of the element. For example, maybe you
want to provide the same background to your floating callouts as you do to
named entities. All you need is assigned something like teiHi and teiName
to html:div/@class, then in css do something like this:
Plus with this approach think of all the cool things you could do with css
selectors, e.g., to pick every tei-derived element in your html just use
this selector in your css:
responding to SR's original post:
There's no reason, surely, that you can't create a span with display:
block and float: right? I do that all the time. Another option for
positioning is not to float, but to set it as position: absolute, specify
a width, and right: 2em or something like that, so the thing appears as a
block next to the right edge.
responding to Martin Holmes
I think the problem may be that those notes may themselves contain
block-level structures that won't fit into HTML <span>
responding to Martin Holmes
no. but when that <span> has <p> and <ul> inside it, the validator howls.
Martin Holmes: responding to Peter Gorman:
In that case, it's either divs all the way down, spans all the way down,
or an arbitrary point at which you move from divs to spans, and then check
the ancestry in every template to see whether you've passed that point.
Rendering the notes at the end is simpler, though. You'd have to put an
so you can retrieve the right offset. There's also the problem of notes
and labels overlapping if the margin is not wide enough or the font size
is too big.
Sebastian Rahtz: responding to Martin Holmes' first point:
right. I can sort of see how jQuery position() or offset() is going to
help do this. if anyone is bored enough to write up a proof of concept of
that, I¹d be very happy :-}
responding to Martin Holmes' point about overlapping notes and labels:
thats much much harder. makes me feel mildly ill even thinking about the
chaos which could result
For most of the typically very short and simple marginal notes in Early
Modern texts, display in the margin is a real benefit. Where you have
complex marginal notes--as for instance in Ben Jonson's Works--moving the
notes to the end may be the better solution, and displaying them as
marginal notes on a screen may be a nuisance. I think there are
algorithmic ways of cleanly dividing the cases.
Sebastian Rahtz responding to Martin Mueller
What would your algorithm be? You are suggesting a much simpler solution,
which is say that all complex side notes should be converted to endnotes,
without trying to move them at all. But is this solely based on whether
they have internal
Martin Mueller responding to Sebastian Rahtz
I need to have a look at more examples. Paul Schaffner probably has all
the cases in his head. But my hunch, to be confirmed by trawling through a
sample of the TCP corpus, is that very few marginal notes have internal
block level components
Sebastian Rahtz responding to Martin Mueller
You may be surprised. I can detect 28504 occurrences in 2448 texts from
the 61k texts
in EEBO/ECCO/Evans. That¹s occurring in 1 out 28 books, then.
Stuart Yeates responding to Sebastian Rahtz's original questions
Personally I'd love to do (a), forget about HTML validity, but I'm not
sure what level of browser support there is. Maybe a couple of sample
cases could be run through http://netrenderer.com/
/http://browsershots.org/ and checked for serious issues ?
I'm also aware that increasingly ePub and other HTML-containing
TEI-output formats may be the inputs into third-party toolchains. It
may be more robust to have a semantically correct HTML output option
for such cases.
Louis-Dominique Dubeu responding to Stuart Yeates:
I advise against this. When you give a browser invalid HTML, you are
venturing into "undefined behavior" territory. If it works when you test
it *now*, it's just luck. If it works with version X of browser A, there's
no guarantee that it will work with X-1 or X+1. I've recently run into an
Chrome 34. It was fixed for Chrome 35 **only because** there is a standard
and a defined behavior that people had been relying on and this was
promptly brought up in the bug report. In the case of bug reports where
the change in behavior does *not* run afoul of a standard, good luck
getting speedy resolution, or any resolution at all.
Paul Schaffner responding to Martin Mueller:
Many examples of 17th-century printing challenge the
very distinction between 'main text' and 'margin'.
The one I was editing this very minute, for example,
is far from exceptional:
But are things any different nowadays? Magazine and
web page layout, for example.
If you're looking for long, elaborate notes, the
18th century philosophical novel The Life of John
Buncle springs to mind. On this random page, for
you can see just two lines of 'main' text at the top
of the page; the rest of the page is occupied by
the conclusion of footnote 25 (which occupied most
of the preceding three pages); the beginning of
footnote 26; and an asterisk-flagged footnote
that nests within note 25. Notes in this book
routinely themselves have notes; routinely contain
poems, chunks of plays, multiple stanzas and
paragraphs, block quotations, etc. etc. But again,
this is not that unusual in modern printing either,
especially of the academic kind. (One of the
chief advantages of my old Nota Bene word processor
was that it supported three simultaneous *series*
of footnotes, all of which could display on the
same page; and I was not the only one who could
foresee a use for such a feature -- reflected in the TCP
texts through the use of values like @place="marg1"
None of which helps Sebastian; if anything, the reverse.
Probably my favorite online note display is that
used by the CCEL, e.g.
in which users may select (go to the little gear at top
right) to see notes displayed in
the margins, at the foot of the page, or (the default)
suppressed altogether till clicked on. If I remember rightly,
CCEL uses html:span for all the notes, and flattens markup
within them. But that may be wrong.
Stuart Yeates responding to Louis-Dominique Dubeu's warning about
"undefined behaviour" with option a):
of my suggestion.
Elisa responding to Sebastian Rahtz' discovery of 28504 structured
marginal notes in 2448 texts:
Having spent much time with heavily annotated long poems of the 18th- and
19th-c. by the likes of Erasmus Darwin and Robert Southey ( whose very
annotations I was just talking about last week at our conference in
Evanston), I am aware of how long and complicated these can
become--Sometimes whole poems are written out in long footnotes, and quite
frequently we see block-level structures, yes. I am not really happy with
the common tendency to push annotations to the ends of documents,
particularly when they were originally presented so the eye would move
across or down a page to a layer of paratext. This may sound awfully
unsightly to the e-reader aesthetic, but there is something to be said for
having the web interface preserve the positioning of notes embedded within
and immediately accessible from the lines of poetry or chunks of prose
text in which they're signaled. I don't much like the idea of HTML's
losing this simple association of proximity--it seems like caving to
convenience and worse, pushing a layer of paratext away from its point of
association. But I may just be obsessed with note-heavy Bob Southey.
Paul Schaffner responding to Elisa
I agree: the demotion of marginalia (and the other things that
people call 'paratext' these days) is to be resisted if at all possible.
Perhaps it's time to revive html:frameset ! (one frame for text,
one for notes ...) :)
Peter Robinson responding to Paul Schaffner
Oh no! not frame sets!
It is a perfectly straightforward process to push your notes out into a
html div, and the text into another html div, and then use the Œfloat¹
style attribute on the two divs so that your notes appear to the left, or
right, or both, of the text they annotate. Simple css, indeed (google
"floating div css² for lots of examples). You can go further, and use
browser window, and then place the annotation at the appropriate hight to
the left or right. And much more.
Stuart Yeates adding to Peter Robinson
If nothing else, framesets can't be used in ePub.
Sebastian Rahtz responding to Peter Robinson:
i might contest that. notes inside notes inside notes are not so very easy.
Stuart Yeates adding to Sebastian Rahtz:
In our experience it's relatively straightforward until you start
wanting page-break paraphernalia (anchors, page images, navigation,
etc) at each of the nested levels of notes. Users have an expectation
that they can link to 'page 123' of a document; and that following
that link will take them to a representation of the intellectual
content on that page in the print book. Making that happen reliably is
The NZETC has certain features that can't be used reliably in
combination, for example nested footnotes and works with back matter
printed backwards (i.e. to be read from the back page forwards towards
the first page). Fortunately those aren't widely used in combination.
Conal Tuohy commenting on Sebastian Rahtz' original question:
My vote would be for (b) - to use <div> for pretty much everything, and to
deal with typographical issues in CSS.
I do sympathize with the desire to maximize the retention of TEI-encoded
semantics, but these days I am less inclined to believe there is any
significant payoff in doing so, and I think the additional complexity of
the stylesheets is a barrier to modular reuse and a prohibitive cognitive
burden for many people who might otherwise contribute to the stylesheets.
Stuart Yeates responding to Conal Tuohy:
Is that a viable solution for ePubs read on epaper devices? I thought
there were pretty strong limits on what you could get away with in CSS
on such devices.
If we're doing everything in CSS, why go to HTML at all, why not
follow TEI Boilerplate
(http://dcl.slis.indiana.edu/teibp/content/demo.xml) and use TEI+CSS?
Peter Flynn responding to Sebastian Rahtz' original question
I once played around with making the notes into divs, but outputting
them after the end-tag of the paragraph or other
mixed-content-containing element in which they occurred. But I haven't
checked to see if this is valid XHTML/EPUB3 because I haven't had to do
it for some while.
The problem I had was aligning them with the point of reference, if you
are inserting something for the user to click on. Or were you intending
them just to be there of their own accord as the source paragraph
scrolls into view (like paper marginal notes)?
Sebastian Rahtz responding to Peter Flynn:
I have indeed bunged in notes as floating <aside>s after the end
of the containing paragraphs, and then jerked them up to align with the
of insertion with position: absolute. Now, of course, the wretched things
Andreas Wagner commenting on Sebastian Rahtz' option f):
Taking the risk of making a fool of myself: What is the argument against
again? Obviously the nesting of spans make it a complex thing and I admit
being ignorant of exactly how complex this can get, but are you all
sidestepping (f) because of this or because of a different and more
important, yet unmentioned problem?
Stuart Yeates responding to Andreas Wagner:
My argument against (f) is that it splits a single logical entities
into sequences of two or more XML elements in such a way that breaks
everything that expects logical entities to be contiguous or
referenceable by a single ID-REF. This breakage ranges from things as
esoteric XML and web infrastructure.
The real answer that this is a symptom of TEI being more expressive
than HTML. There is no 'best' solution, merely a number of potential
tradeoffs whose relative merits are dependent on the kinds of
documents one has and what you're trying to do with them.
Sebastian Rahtz adding to Stuart Yeates
apart from the very real semantic problems which Stuart gives (which I
think are acceptable if processsing a <pb/> like this,
but not otherwise. try this
<p>I have endeavoured in this Ghostly little book, <span
style="font-style:italic">to raise the Ghost</span><div
style="float:right">a phantom</div> <span
Idea</span>, which shall not put my readers out of humour with
themselves, with each other, with the season, or with me. May
it haunt their houses pleasantly, and no one wish to lay
you¹ll see that though the inner <div> floats OK, the italic span has a
in the middle. This renders the technique useless, sadly.