[recognition of patterns in scanned text]
* Is anyone aware of any other work on corpus compilation where
particular attention was paid to detailed and consistent markup of
layout characteristics such as headings, sections, lists, and tables?
Are there any such corpora (or DTD's) already available?
You might want to talk to Corel Corp about Envoy, which they took over
from WordPerfect. This was a print driver for Windows which camped on
your WP/DTP output and analysed the data stream looking for patterns; it
was claimed it could construct HTML from the logic "evident" behind the
layout (that's not from Corel, but from a user). I don't now know anyone
using it, but I believe it's still available.
* For formatting and layout convenience, the output of the scanning is
currently converted into Word. The conversion to HTML is done via a tool
called RTFtoHTML which unfortunately throws away all word formatting
such as lists, tables or spacing.
The problem with such "formatting" is that it does not identify
itself. You need to analyse the motion commands (left, right, up,
down, etc) and measure the results, then determine what the original
author's intent was. For example, here's a translation into readable
English of the Word internals which represent a list scanned from a
piece of sales blurb here:
ReturnToLeftMargin
DownSingleLine
DownUnits 6pt
MoveRight 22pt
Font WindDings 14pt "r"
MoveRight 5pt
Font TimesNewRoman 12pt Better resolution than any competing
ReturnToLeftMargin
DownSingleLine
MoveRight 42pt printer for the price.
ReturnToLeftMargin
DownSingleLine
DownUnits 6pt
MoveRight 22pt
Font WindDings 14pt "r"
MoveRight 5pt
Font TimesNewRoman 12pt Crisp, clean colors on any paper, with
and so on. Now if you can write something which will detect the
"list-ness" of the above, and output
<UL><LI>Better resolution than any competing printer for the price.</LI>
<LI>Crisp, clean colors on any paper, with optional gloss finish.</LI>
</UL>
then I am sure it would have a ready market.
Can anyone suggest a Word-to-HTML
program which does a better job of preserving structures?
But the whole point is that the structures are not there to preserve
in the first place. Visual positioning (=Word) is not structure.
///Peter