Print

Print


[recognition of patterns in scanned text]
 
   * Is anyone aware of any other work on corpus compilation where
   particular attention was paid to detailed and consistent markup of
   layout characteristics such as headings, sections, lists, and tables?
   Are there any such corpora (or DTD's) already available?
 
You might want to talk to Corel Corp about Envoy, which they took over
from WordPerfect. This was a print driver for Windows which camped on
your WP/DTP output and analysed the data stream looking for patterns; it
was claimed it could construct HTML from the logic "evident" behind the
layout (that's not from Corel, but from a user). I don't now know anyone
using it, but I believe it's still available.
 
   * For formatting and layout convenience, the output of the scanning is
   currently converted into Word. The conversion to HTML is done via a tool
   called RTFtoHTML which unfortunately throws away all word formatting
   such as lists, tables or spacing.
 
The problem with such "formatting" is that it does not identify
itself. You need to analyse the motion commands (left, right, up,
down, etc) and measure the results, then determine what the original
author's intent was. For example, here's a translation into readable
English of the Word internals which represent a list scanned from a
piece of sales blurb here:
 
   ReturnToLeftMargin
   DownSingleLine
   DownUnits 6pt
   MoveRight 22pt
   Font WindDings 14pt "r"
   MoveRight 5pt
   Font TimesNewRoman 12pt Better resolution than any competing
   ReturnToLeftMargin
   DownSingleLine
   MoveRight 42pt printer for the price.
   ReturnToLeftMargin
   DownSingleLine
   DownUnits 6pt
   MoveRight 22pt
   Font WindDings 14pt "r"
   MoveRight 5pt
   Font TimesNewRoman 12pt Crisp, clean colors on any paper, with
 
and so on. Now if you can write something which will detect the
"list-ness" of the above, and output
 
   <UL><LI>Better resolution than any competing printer for the price.</LI>
       <LI>Crisp, clean colors on any paper, with optional gloss finish.</LI>
   </UL>
 
then I am sure it would have a ready market.
 
   Can anyone suggest a Word-to-HTML
   program which does a better job of preserving structures?
 
But the whole point is that the structures are not there to preserve
in the first place. Visual positioning (=Word) is not structure.
 
///Peter