The work of Dan Connolly on "A Lexical Analyzer for HTML and
Basic SGML" may be of interest to those working in this area.
His proposal is at <url:http://www.w3.org/pub/WWW/TR/WD-sgml-lex>.
This is an attempt to document in a formal way the subset of
SGML used in HTML. He claims HTML, along with TEI, DocBook and
a couple others are "basic SGML documents" as defined in the
standard with few exceptions. An implementation of a lexical
analyzer compatible with this subset is offered and may
be of interest to those following this thread.