* If you subscribe to TEI-L or SGML-L and have *
* peviously contacted the SGML Project, you may *
* receive duplicate copies of this report. *
* We apologise for any inconvenience. *
UNIVERSITY OF EXETER COMPUTER UNIT SGML/R7
THE SGML PROJECT
APPLYING INFORMATION RETRIEVAL TO MARKED UP OR
STRUCTURED DOCUMENTS - NEW YORK 18TH OCTOBER 1991
4 December 1991
The promotional literature defined the subject of this one day
symposium as follows:
"More and more publishing activities (internal or external)
will generate structured documents and corresponding data bases.
Although this evolution is slow, real SGML applications are emerging and
ODA is being increasingly accepted.
As greater quantities of text are stored on line, information
retrieval will have to deal with these more complex and richer documents
As time goes on, retrieval software should use structural
information to improve the performance of searching. Some companies are
already moving in that direction.
This problem is profoundly related to interaction between information
content, linguistic formulation, logical and physical structure of the
documents, presentation and typography."
The symposium was organised by the Centre for the Advanced Study
of Information Systems Inc (CASIS) - which, to cite their own
literature, "was created as a non-profit Association incorporated
in the State of New York, during the International Congress
organised by the Centre de Hautes Etudes Internationales
d'Informatique Documentaire (C.I.D.) .... in March 1988".
The symposium took place at the Rockefeller University, New York,
with about eighty attendees drawn from both the academic and
commercial worlds. The majority of attendees were from North
America. I attended in my capacity as a representative of the
SGML Project, based at Exeter University's Computer Unit.
1. Introduction - Prof F Sietz, Don Walker
The symposium opened with a brief introduction from Prof F Seitz,
(President Emeritus Rockefeller University), who gave a succinct
resume of the University's history. Prof F Seitz replaced Pierre
Aigrain (President of C.I.D.) who had been due to give
introductory remarks, but was, unfortunately, unable to attend.
Don Walker (Bellcore) described the purpose of the symposium as
being to look at and discuss the latest developments in
information storage and retrieval techniques. The particular
focus would be on applying markup concepts (especially SGML) to
enhance information retrieval.
2. From Overflow of Documents to Answering Impredictable (sic)
Questions - Norbert Paquel (CANOPE, France)
Paquel began by outlining what he called the "People's Dilemma"
namely, the choice between developing new and powerful
database management systems, tools etc., or storing very
structured information, in order to make the process of
information storage/retrieval easier. Of course the more people
prevaricate, the worse the dilemma becomes.
On the subject of `impredictable (sic) (unpredictable?)
questions' (or `fuzzy logic'), Paquel made two observations. He
raised the issue of whether there was, indeed, such a thing
or whether it was simply the result of using a system of badly
structured information or storage. However, he also pointed out
that one would not wish to learn that there was such a thing as
an impredicable (sic) question at a crucial point in the live
running of a system to control a nuclear power plant!
Paquel offered three general techniques/types of structured
i) structure ideas and concepts (within the data,
ii) employ linguistic and textual structures
iii) structure documents using SGML, ODA
Paquel concluded by suggesting that structuring information
through the use of text embedded tags (eg using SGML etc),
currently seems to offer the best way to meet the problems posed
by having to find answers to impredictable questions.
3. The Text Encoding Initiative - Don Walker (Bellcore, U.S.A.)
Walker stated that the Text Encoding Initiative (TEI) had been
set up with the intention to `take over' from SGML, and extend it
in to a more significant approach to tagging. The TEI was also
set up in recognition of the fact that too much effort to devise
ways of marking up texts was being done idiosyncratically, and so
was not facilitating the exchange of information. Walker in
his capacity as a member of the TEI Steering Committee then
briefly discussed the main purpose and goals of the TEI.
Walker described the history of the TEI, and outlined its current
structure, sponsoring bodies etc. He then went on to define the
functions of the TEI's four main committees, namely the Committee
for Text Documentation, the Committee for Text Representation,
the Committee for Text Analysis and Interpretation, and the
Metalanguage and Syntax Committee. He briefly outlined how the
work of these committees had led to the production of the report
"TEI P1: Guidelines for the Encoding and Interchange of Machine-
Readable Texts for Interchange" in July 1990 (of which 1000+
copies have since been distributed world-wide).
Walker outlined how the TEI had recognised early on that there
would be a need to have working groups who could concentrate on
specific areas and make recommendations to the relevant
controlling Committee. He then gave brief details of each of
the various working groups such as their area of concern,
chairman, and current state of progress.
Walker then listed the TEI-affiliated Projects, and pointed out
that the decisions taken by the TEI advisory board will thus have
some influence on the use of SGML and structured markup in the
work of approximately 100,000 people. He went on to discuss a
schematic view of the TEI-corpus (placing particular emphasis on
the importance of marking up header information), and also
presented the SGML software environment envisaged by the TEI.
Walker stated that the work of the TEI is now entering its final
stages. He described the procedure which will enable the
production of TEI-P2, the second version of the TEI's "Guidelines
...", and the process of review which will eventually lead to a
final version of the report being produced by July 1992. Walker
stressed that the final report will be intended to serve as a
reference document, and so will not be an `easy' read. It will
contain a number of discipline specific tutorials (although these
have yet to be chosen), as well as some case-book samples -
employing extended examples and supporting prose commentary.
Walker urged all interested listeners to ensure that they had
subscribed to the TEI's discussion-list
(TEI-L@UICVM) if they wished to learn how to obtain a review copy
Walker closed with an outline of some of the TEI's plans for the
future, which included the following: Establish an institutional
base, get additional funding, elicit involvement from the
Japanese, continue development and maintenance of the "Guidelines
....", establish a systematic evaluation program, facilitate
relevant education and training, encourage software development
that supports the "Guidelines...", and explore the legal issues
surrounding the user and users of tagged texts. Lastly Walker
expressed regret at the lack of working groups to consider the
cases of newspapers and reference works - primarily because no-
one was prepared to take on the responsibility for these areas -
despite his personal belief that they would be influenced by the
recommendations given in the TEI's "Guidelines...."
In the time made available for questions and answers, Walker was
asked if the TEI restricted itself only to giving guidelines on
marking up textual representations. Walker replied that other
types of representation will be given little or no consideration
in the TEI's final report, but that many people involved with the
TEI are very interested in graphic/time-based/music/multi-media
etc. representations. Later, Walker mentioned that one of the
working groups under the Text Representation Committee is looking
at how best to handle ideographic representations for
languages such as Japanese.
It was pointed out that the levels of details given references
and citations can vary dramatically from text to text, and that
this must surely cause problems for the TEI. Walker
acknowledged that this was a difficult area, but said that the
TEI were trying to establish a consensus on how to deal with
references/citations (and noted that there would be different
conventions for the different types of document).
Asked to comment on the metalanguage used by the TEI, Walker said
that the relevant committee had confirmed that SGML was the best
available option, but was now examining how SGML falls short of
TEI requirements (and how ISO 8879 [the SGML standard] should
best be developed).
4. Linguistic Analysis: exploiting structure and extracting
structure - Lisa Rau (General Electric Co., USA)
Rau is a member of General Electric Co.'s Artificial Intelligence
group, looking at natural language processing and linguistic
analysis. She began by identifying the problem which concerned
her most at work: how to exploit the information in on-line
text? Rau suggested two approaches that might offer a solution
content-based automatic text interpretation, and automatic
markup for indexing.
Rau went on to propose several `kinds' of structure which could
be marked up using SGML;
Document syntax - eg. paragraphs, sections, sentences,
Document semantics - topic, proper-names, indexing terms etc.
lexical level: word senses, part of speech etc.
sentence level: type of utterance, purpose etc.
semantic level: cultural/temporal orientation, theme,
plot, style etc.
Meta-information - author, author information, ISBN, data,
Rau then described how linguistic analysis, using natural
language processing techniques had enabled some substantial
achievement to be made concerning the automatic extraction of key
information from texts. She claimed that many current natural
language processing systems were not merely capable of
recognising and extracting information from a text, but were also
able to generate indexes, identify key-words, and insert some
tagging and markup. Rau gave an example of some software
produced at General Electric Co. that was able to read natural
language text supplied by a business newswire service, and
extract the key information and facts (ie. who did what, when,
where, and to whom).
However, Rau pointed out the current natural language processing
systems were not yet fully capable of extracting enough structure
from a text to create a complete automatic index. This would
involve determining syntactic tags (correctly identifying every
paragraph, sentence etc.), semantic tags (correctly deciding what
should be treated as an index-term and/or keyword) and extracting
important words and phrases (eg correctly recognising all proper
names, brand names etc.).
Rau claimed that there are various ways to exploit the structure
of a text, such as using text segmentation to tell an application
where best to search for information, recognizing embedded
constructions as an aid to parsing, and using knowledge of the
structure to facilitate browsing (eg. linking together entries in
the table of contents and the appropriate section headers within
a hypertext system). Any knowledge about the topic of the text
can obviously help an application resolve any ambiguities that it
Rau then discussed two examples, the first of which was a case
where natural language A.I. tools had helped in the automatic
processing of some electronic and on-line texts. The texts were
parsed to generate databases of the information contained in
each, which could then be browsed and the original documents
directly accessed on-line via a hypertext interface. The second
example involved a situation where the users wanted to use some
new natural language processing software, but with their existing
hardware and their current information base. In order to
replace the existing text databases (which had human indexing of
keywords) and the existing text retrieval system, A.I. and
natural language processing techniques were used to automatically
extract keywords from the texts and sort them into index
Rau also made some observations on the subject of Information
Retrieval (IR) in general. In her opinion, much of the work
being done in the field of I.R. is obsessed with the issue of
getting the wrong answer in response to a specific query
whereas in practice, Rau claimed, people will want to browse
large amounts of electronic text with only a vague query. Rau
also felt that when I.R. experts have experimented with natural
language text processing it has always been done on too small a
scale to either wholly demonstrate or refute the worth of natural
language-based retrieval products and/or techniques.
Rau summed-up by re-affirming her claim that natural language-
based text retrieval products are almost a reality, and added
that structured information and texts should aid their operation.
She closed by repeating her plea to workers in the I.R. field to
fully explore natural language based retrieval techniques.
5. Semantic Extension to Text Retrieval - James Driscoll
(University of Central Florida, USA)
Driscoll used the example of a project he had been working on for
NASA, as the basis for his presentation. NASA staff are
bombarded with questions (from journalists etc.) during the
period immediately surrounding a shuttle launch, and a great deal
of information has to be sifted through in order to find the
correct answers. Driscoll was asked to automate the process of
information retrieval (I.R.), with the performance of his
solution being measured in terms of the number of (redundant)
paragraphs and pages that users had to read before finding the
correct answer to a question.
Driscoll's approach was to devise an I.R. system that takes in a
natural language query and returns a list of the most likely
references which the user can then look up on-line.
Driscoll's presentation was quite detailed and complex, but as I
understood it his system relied on an approach built on
'themantic' knowledge of text (where themantic =
Driscoll then gave an example of how his system works, and
discussed the formula used to establish the `most likely' answer
to a given query. The formula relied upon a weighting system,
based on the use of codes to indicate the semantic role, and
attributes, of particular words. Driscoll gave some examples of
I.R. using a `paragraph+keyword' approach versus
`paragraph+themantic(semantic?)' approach; the latter performed
consistently better, and there was a slight improvement if work
was done at sentence rather than paragraph level.
Driscoll had also hand-crafted some lists of `word-triggers' to
act as a guide for the I.R. software, by linking words with
details of their themantic roles, preposition roles, and
attributes. He then showed some sample tables which described
the various themantic role codes, and also attribute codes, and
lastly discussed how he had created a machine-readable lexicon by
coding entries in Roget's Thesaurus using his scheme of themantic
In summarising, Driscoll felt that there were problems inherent
in the decision to perform experiments whilst development was
still on-going. He also foresaw difficulties in getting the
optimum `blend' of his work and traditional basic keyword
retrieval techniques. The lexicon would also require more
thought to decide how big it should be, what other themantic
triggers might be required, whether or not it should contain
prepositions, and so on. Driscoll also felt that there would
still be cases of ambiguities in natural language that his
approach would (currently) be unable to handle but which could
possibly be resolved by introducing such notions as attachment,
context or A.I. frames into his work. (It was pointed out that
some of his immediate problems might be resolved by concentrating
on the specific case of natural language use in NASA documents,
as opposed to use in any document since he appeared to have
adopted quite a high-level view of the problem in linguistic
6. The Utilization of a Text Algebra for Retrieval of
Information from a Hierarchical database incorporating
heterogeneous structured documents - F. Burkowski (University of
This was quite a technical presentation which I would not pretend
to have understood. Anyone who is very interested in this area
of research should contact Burkowski directly, and request a copy
of the slides that he used. I shall attempt to summarise his
main pints below.
Burkowski began by stating that he was hoping to create a text
algebra that has the same degree of rigour as the algebra that
typically underlies a relational database and which will,
therefore, enable similar resolution of users' queries. (NB.
This will not necessarily involve using a relational database).
Burkowski asserted that many operations in text algebra can be
visualized in terms of node selections from a (hierarchical)
tree. He then went on to demonstrate how a text such as
Shakespeare's "Macbeth" could be interpreted in such a way that
it could be broken down to the level of individual words which
could then be stored as terminal nodes on a straightforward
Burkowski stated that in his text algebra each operator in a
query will take two concordance lists as its operands. The
algebra is based on set theory and is able to make us of such
operations as union, differences, adjacency and proximity. When
a text is loaded into the database, Burkowski proposed that the
loader and the parser should use text algebra to make a static
representation of the text within the database. When tackling a
query, the retrieval tool and parser use text algebra to generate
a dynamic concordance and/or representation of specific parts of
the text. Burkowski also suggested that there should be two main
types of filter on query retrievals: `select wide' and `select
Burkowski felt that it was very important to isolate and protect
the user from the (horror) of the boolean text algebra into which
the I.R. interface turns the users' query, so that it may be
resolved by the retrieval engine. Burkowski suggested that both
full text searches and browsing of hierarchical databases could
both be done using the same underlying text algebra to interpret
queries and access the database.
7. Incorporating human aspects in human computer interaction in
information retrieval - T Saracevic (Rutgers University, USA)
Saracevic pointed out that the notion of processing texts for
"information retrieval" actually represented a very broad topic
and that the situation was further complicated by trying to
use 1990's hardware technology in conjunction with 1950's
boolean-based software techniques! He briefly discussed several
different approaches used in research on processing texts, but
commented that most would still be unable to resolve a direct
common sense query (and he used as his example "What are the
arguments against building a space station?")
Saracevic expressed his belief that people have a profound
influence on the success (or otherwise) of I.R. tasks. He felt
there was a need for more research to look into such matters as
to how people perform I.R. tasks, why some people perform given
I.R. tasks better than others, and how changes in technology can
influence the success/failure of people performing I.R. tasks.
Saracevic also argued that there is a need to look at how the
performance of different user groups vary; for example, how
professional indexers approach a text as opposed to a group of
general readers. In other words, we need to identify those
techniques which enable a particular group to be successful.
Saracevic cited the results of some recent tests involving
performance-comparisons of `most successful' versus `least
successful' groups. Thus when attempting to find some specific
information in a text, the worst time was five times longer than
the best; to create an index of keywords, the worst time was
nine times longer than the best; to debug a program, the worst
time was twenty-two times longer than the best. However,
Saracevic noted that even consistency in the work of two
professional indexers using the same text can be as low as 34%
(and it would seem that agreement on relevance/irrelevance seems
to depend on how well the indexer knows the subject of the text).
Turning his attention to work done on On-line Public Access
Catalogues (OPACs), Saracevic said that studies showed that
people come up with very different possible headings for the same
subject, topic-heading, or even book! Overlap in the selection
of search terms used by a group of researchers looking for the
same subject and in the same texts, is also very low less than
27% for any sort of overlap, and less than 1.5% of any two sets
of search terms overlapping completely (and all the participants
were professional/expert searchers!) Another test showed that
whilst searching for items on a given subject using an OPAC ,
librarians had retrieved an average of 728 items, expert end-
users an average of 722 items, and novice-end users only 435 (?)
items; however, the overlap of items retrieved was only 3.5%.
Saracevic concluded by re-stating his belief that no matter how
sophisticated an I.R. engine and/or database management system,
the human factor will always cause the greatest differences in
the success of a search. Therefore, it would appear to be
vitally important that we should re-think the current designs of
I.R. tools and the interfaces to them.
9. Exploiting SGML Information for Document Retrieval - Yuri
Rubinsky (SoftQuad Inc., Canada)
The main thrust of Rubinsky's presentation was to show how the
different kinds of markup that SGML supports (and the
consequences of each) give us new kinds of power over texts.
Rubinsky then looked briefly at a number of SGML structural
constructs, these being; element structures, linking through
attributes, entities and entity references, commenting
procedures, marked section handling, and processing instructions.
With the help of a number of examples, Rubinsky outlined a
technique for handling the process of reviewing an electronic
document, making, collecting and co-ordinating reviewers comments
etc. He also discussed ways of cross-linking several texts, and
how to create different views or versions of texts. Rubinsky
claimed that the use of SGML attributes has implications for the
storage of documents in a database, document retrieval, and so on
for any software that understands document attributes is able
to control what readers can do with a document.
Rubinsky then discussed some work that had been done for the US
Army Regulations database. Since the Army has four different
basic types of document, it requested that there should be four
corresponding SGML Document Type Definitions (DTDs) rather
than one DTD which encompassed all four types. Even though the
Army's requirements were met, their DTDs have very similar
structures because the documents being modelled have several
structural elements in common.
Rubinsky then went on to briefly discuss examples of some
commercial products that exploit the presence of SGML structured
markup. He looked first at Dynatext (from Electronic Book
Technologies), which is able to turn a minimal SGML document into
a hypertext database, but can also show the hierarchical
structure of the document, search the document and offer
different views, display graphics etc. Rubinsky then discussed
SGML Search (from AIS/Berger Levrault), and showed how queries
and searches could be formulated on the basis of document
attributes. He then went on to demonstrate how the OWL Guide
hypertext system (from Office Workstations Ltd)., is able to
create/use reference buttons on the basis of structural markup in
the underlying SGML document. Unfortunately, Rubinsky was then
obliged to cut short his presentation due to time restrictions.
10. Distinguishing Intelligence from Formatting - Sam Wilmott
(Exoterica Corporation, Canada)
Wilmott gave a succinct account of the history of text markup
recounting the development from typesetting codes through to
SGML. Wilmott claimed that SGML seems to be used with two goals
in mind, text interchange and/or capturing information.
On the subject of text interchange, Wilmott made three
observations. The ability of SGML to facilitate text
interchange has attracted increasing attention with a growing
need to interchange documents between different computer systems.
Using SGML as a basis for document interchange is conceptually
simple and highly effective. The widespread availability of
WYSIWYG software has removed the need for markup languages to
handle the large class of small, straightforward, visually-
However, Wilmott claimed that use of SGML is now moving more
towards marking up the information that texts contain. He argued
that this was closely related to the need to develop documents
for use by hypermedia systems, which require a new approach to
capturing textual information. SGML not only enables essential
information to be identified by tags, it also allows the
development of detailed subject-specific notations which
permit a great deal of additional information to be represented
in a fashion that is both useful and economical.
Wilmott then went on to suggest that within a text, certain
information is conveyed in the structural relationships that
exist between items, and that this information can be captured
using SGML-based markup. Moreover, Wilmott claimed that "..it
is the smallest details that have the most complex inter-
relationships with other information: chapters (in the absence
of references to them from inside other chapters) are typically
just chapters, but acronyms can refer to almost anything";
consider also the cases of detailed notations for things such as
part numbers and bibliographic citations.
Wilmott also discussed the concept of `intelligent documents'.
which he defined as any document "...that knows more about itself
then can be seen or deduced by simply looking at the printed page
or its computer screen equivalent". Thus hypertext documents
are ` intelligent' because they contain enough information coded
within themselves to allow users to freely browse the text.
Wilmott then discussed the notion of variant and invariant
information within a text. "Invariant information is that which
can be deduced by examining the context of a unit of
information"; thus, in formatting terms, the typeface used to
set the text of paragraphs on a printed page will probably be
invariant. Similarly, invariant information about a text whose
logical structure has been marked up with SGML, might simply be
that every chapter will begin with a tagged chapter title.
Whereas, if I understood Wilmott correctly, the variant
information would be the text of the title itself. Wilmott
pointed out that the correct identification of variant and
invariant information is the "key to economical capturing of
information" which is vital for the generation of intelligent
Lastly, Wilmott talked briefly about information structure and
representation structure. For example, the separation of
footnote text form its initial reference point is due to the
author's wish to simplify the representation structure of the
document. However, the fact that the footnote text is closely
related to a particular point in the main text is part of the
document's information structure. Being aware of this
distinction is another way to ensure the efficient production of
intelligent documents (eg. hypertext).
Wilmott concluded by stating that ".. the use of markup languages
is the major tool at our disposal for embedding intelligence in
documents. Text markup languages provide the tools for
economical capture of the highly detailed information required by
the new information storage and access technologies."
(NB. Most of the quotations in the above text were taken from
Exoterica Corporation's document "Distinguishing Intelligence
from Formatting", which formed the basis of Wilmott's
11. Interaction with ODA Structure - Mark Bramhall (Digital
Equipment Corporation, USA)
Bramhall was the only speaker to address the topic of the
symposium with ODA (Open Document Architecture) rather than with
SGML in mind. I have only a very limited experience of ODA, and
so any mistakes in this report are probably due to my own
misunderstanding of Bramhall's remarks.
Bramhall briefly outlined the ODA document Model, and defined the
concepts of `generic logical structure' and `specific logical
structure' the former being a definition of the logical structure
of a generic type of document, the latter being a particular text
which is an instance of that generic type. Similarly, he
discussed `generic layout structure' and specific layout
structure', which exist in the same relationship to one another
as the two types of logical structure but are concerned instead
with document formatting. Bramhall mentioned the existence of
rules to control the transformation between specific logical and
specific layout structures, and also talked about the way ODA
separates structuring architecture (eg notions of pages, frames
etc.) from content architecture (eg. graphics, text etc).
Bramhall identified three modern information retrieval
techniques: Content-Based Retrieval (CBR) ( which uses data-
directed links between pieces of information), Hyperlinks
(which are human-directed links between pieces of information)
and Relational Queries ( which are data and metadata
(ie.attribute) directed links, which can be manipulated with
Bramhall then described the Office Document Interchange Format
(ODIF) standard, which enables ODA document instances to be
successfully exchanged. Bramhall noted that ODIF requires the
hierarchical structure of an ODA document to be mapped into a
sequential format using a block-structured (ie. scoped)
approach to ensure that the logic of the hierarchical structure
is maintained. Bramhall saw the main drawback to this approach
as being the need for a receiving system to completely
internalize the logical hierarchical structure of an ODA
document, before it can directly access a piece of the text.
This complete internalization is the only way a system can work
out which attributes are in effect for any given block.
Moreover, the actual content of an ODA document is stored
separately from information about its structure, which only
contains forward references to where the appropriate content
fragments can be found.
Bramhall stated that it had been recognized that the run-time
representation on an ODA document instance must support both
traditional and modern retrieval methods, processing operations
etc. Thus Bramhall suggested that a document should be defined
as a node (and all its sub-nodes) in a `representational document
database' ie. it should be regarded as a tree. A document
node can be any size, complexity, or data type and so could
theoretically consist of anything from a single word or video
clip, right up to the entire contents of a national library.
This approach means a document can consist of any number of
different data types, should its author(s) so wish.
Bramhall then discussed a diagram of a simple representational
document database, and showed how pointers could be used to
access various parts in the document's structural hierarchy. He
also pointed out that it would be reasonable to allow subordinate
(child) nodes to be shared between two or more superordinate
(parent) nodes within the same database; for example, a document
database may contain several sub-document nodes (eg. reports)
which have a particular subordinate node (eg. report
introduction) in common. This would mean that although the
structure of the overall representational document database was
no longer that of a tree, any particular instance of a sub-
document, such as a single report, would still have its own tree
structure. It is also important to note that whilst every node
`knows' what it contains (ie. its immediate subnodes (children),
a pointer to a content fragment, or nothing), each has no
knowledge of its superordinate (parent) node(s). (However this
does not mean that a superordinate (parent) node cannot be
determined, only that such information is not stored in any given
Bramhall felt that the advantages to this approach were that by
treating all nodes as documents, a common set of document
operations can be applied to any node. Furthermore, a document
(node) will always `look' the same no matter how one arrives
there, and the tree structure of a document instance will allow
recursive techniques to be applied.
Bramhall argued that a representational document database will
support modern information retrieval techniques. The fact that
each node `knows' about itself, means that relational queries can
constrain requested data in a number of ways (eg. extract only
section titles, authors' names etc.). Moreover, Content-Based
Retrieval (CBR) indexes can be created for a document whenever
convenient (ie. as a batch job after the document has been
created or modified), and hyperlinks can follow the connections
between nodes in the representational document database.
Bramhall concluded by stating that any ODA-compliant application
must be able to read and write ODIF since ODIF is not a
suitable run-time representation for a document that needs to be
processed. He then restated his belief that a representational
document database seems the best way to facilitate CBR,
hyperlinks, and relational queries.
(NB. In order to write this section of the report, I drew
heavily on "ODA-Structured Information and Modern Retrieval
Techniques" by Mark Bramhall.)
12. Panel Session - Various Speakers (Chairman: Joshua
Lederberg, Past-President of Rockefeller University, USA).
The Chairman began by expressing his own concerns and opinions on
the topic for the panel session, which was "The practical
evolution of retrieval techniques in enterprise wide document
databases". He then called upon each of the panel members to
make their own comments on the topic, after which the discussion
was opened up to the floor. (Unfortunately, I was not able to
hear the names/positions of each of the panel members, and I hope
they will forgive me if their remarks remain unattributed).
Susan Hockey, the new Director of the Centre for Electronic Texts
in the Humanities (Rutgers University, USA), was the first
speaker. She questioned the use of the term "practical" in the
topic title, pointing out that users would not be concerned with
how the desired results are achieved, but whether they are
achieved. In order to be perceived as a successful I.R. system,
any search tool must be able to find the `right' information,
which in turn will depend upon the structure of the document(s)
Hockey pointed out that in her previous role working with the
TEI, the decision to use SGML had been taken for a number of good
reasons. However, she was aware that SGML had certain inherent
problems, and that it worked best with hierarchically structured
texts, and texts which have only a single hierarchy. Taking the
example of Macbeth mentioned earlier in the day, and plays in
general, Hockey stated that any attempt to consider the physical
printed version of such a text necessarily involves imposing a
different structural hierarchy (vis a vis a consideration of the
content of the text) and some implementations of SGML cannot
handle multiple concurrent hierarchies of a text. This was one
area she would like to see changed in the SGML standard.
Hockey said that the TEI had also been faced with the
intellectual problem of what is meant by "emphasized text",
"proper noun" etc. or rather, what exactly should be encoded
within such tags. Their solution had been to propose the
inclusion of detailed header information, which allows users to
explicitly state the approach to marking up a text that they have
Hockey stated that research in computing and humanities has
traditionally been based upon the use of corpora (marked up
collections of existing texts). However such research has been
unable to provide a means for answering typical humanities
questions, such as "How much was author X's work influenced by
author Y?" To even attempt to answer such a question, involves
areas such as natural language parsing and processing, developing
I.R. tools that can utilise textual markup etc. the TEI has
been concerned with all these aspects, which have so far led to
the production of a very large Document Type Definition (DTD).
Hockey suggested that whilst structural markup using SGML tags
can enhance information retrieval (I.R), there is also a need for
lexical- and knowledge-based databases. She alluded to work in
the field of computational linguistics, to convert dictionaries
into machine-readable concept/knowledge-based forms which can
then be used by I.R. tools.
Hockey proposed that the user interfaces being developed must act
as users expect, and give them what they want. If possible, such
interfaces should model human cognitive concepts when approaching
texts and Hockey said that this was looking towards the sort
of work she would like to see being carried out at the Centre for
Electronic Texts in the Humanities.
Another panel speaker commented that, traditionally, paper-based
documents had always been filed very carefully within an office
environment with the prospect of retrieval always in mind.
However, the current trend is towards ever larger quantities of
information now being stored electronically in on-line databases,
on CD-ROMs and so on. The sheer volume of such information make
the practicalities of its sensible storage and easy retrieval
that much more complex. He felt that the future growth of
information should always be kept in mind by developers of new
retrieval techniques and document databases.
Bob Futrelle (Northeastern University, USA) was the next speaker.
He outlined work being carried out at the Biological Knowledge
Laboratory, where a team are experimenting with tagging
approximately 10,000 texts on biological subjects. Their
experience had been that about 30% of the biological `text'
actually consisted of non-textual data such as tables and
diagrams. Moreover, the team did not want to simply link-in
graphics with their texts, but were attempting to find some way
of parsing images to provide information that could be stored and
used as a basis for cross-referencing, user-defined searches, and
The last panel speaker was Don Walker (Bellcore, USA) who chose
to discuss some examples of current on-going research in the
areas of retrieval techniques and document databases. All the
examples he discussed were text-based, and had adopted an
approach using SGML (or similar markup), a database, and a
variety of I.R. tools. Walker spoke about the activities of the
ACL Data Collection Initiative in particular, and gave a brief
`plug' for a CD-ROM of text data that they had recently released
(although Walker regretted the current paucity of the SGML markup
used, and said this would be corrected in the next release).
Various points were raised in the open discussion which followed.
Accepting Hockey's plea for transparent user interfaces to
complex document databases and I.R. tools, one attendee suggested
that (more proficient) users ought to be able to ask an I.R.
system why it had retrieved certain pieces of information.
Another participant made a general plea for systems which could
automatically recognize the logical structure of a text by using
OCR techniques as basis for interpreting the physical layout and
formatting of an existing document. For example, which could
recognize footnote text by its location on a printed page and
not only tag it as such, but ideally also insert the appropriate
cross-reference between the point in the main body text and the
related footnote. Other examples include the recognition of
boxed text in a magazine article as being in some way
significant, or the ability to automatically join two pieces of
text separated by unrelated material (eg. to understand the
meaning of such phrases as "continued on page 145"). The comment
was made that Avalanche's FastTag software (with its Visual
Recognition Engine) goes some way towards meeting such needs.
Other participants stated that their experiences with OCR
software had highlighted the difficulties of handling anything
other than plain monospaced type (such as diagrams, tables, fancy
proportionally spaced fonts etc.)
Trying to scan and tag existing texts but with the additional
complexity of automatically inserting hypertext links, was also
discussed. This led to a brief discussion about the
difficulties of automatic indexing and indexing from different
Several participants expressed the opinion that this symposium
had been addressing very important issues that had barely been
touched on at the recent SIGIR meeting. There, focus had largely
been on how to achieve the successful resolution of queries, and
little attention was paid to how the nature of the markup used in
the search texts might effect this.
I found this symposium very rewarding. The day was well-
structured and a number of important issues were raised along
with a variety of suggestions and solutions for coping with some
fundamental problems. Everyone seemed to find the exchange of
experiences and comments very useful.
I think the success of the meeting can be judged by the fact that
all the attendees urged CASIS and/or CID to co-ordinate more such
gatherings in the near future.