LISTSERV mailing list manager LISTSERV 16.5

Help for TEI-L Archives


TEI-L Archives

TEI-L Archives


TEI-L@LISTSERV.BROWN.EDU


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TEI-L Home

TEI-L Home

TEI-L  April 2000

TEI-L April 2000

Subject:

ontology/lexicon/kr lists

From:

Patrick Cassidy <[log in to unmask]>

Reply-To:

[log in to unmask]

Date:

Sat, 22 Apr 2000 23:31:40 -0400

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (1013 lines)

                                                                   April
22, 2000

    The following note contains a follow-up to some discussions
held at the meeting of the Association for Computational
Linguistics (ACL) last year, and is now being brought to the
attention of a wider group.  This is being sent to a number
of different listservers, as well as the membership of the ACL
and I apologize for what will inevitably be some duplication.
    Please send all comments directly to me.

    Best regards,
    Pat

=============================================
Patrick Cassidy
MICRA, Inc.                      || (908) 561-3416
735 Belvidere Ave.               || (908) 668-5252 (if no answer)
Plainfield, NJ 07062-2054        || (908) 668-5904 (fax)
internet:   [log in to unmask]
=============================================

To:    Members of the Association for Computational Linguistics
           and others with an interest in knowledge representation,
           lexicons, and lexical semantics
From:     Patrick Cassidy ([log in to unmask])
Subject:  A Request to Participate in a Study of the Utility of a
           Standard Ontology and Lexicon for Natural Language
           Understanding (NLU) and database interoperability

==============
Background
==============
    In recent years there has been a great deal of effort in
building lexicons, ontologies, and terminologies, both for the
purposes of basic research and for practical applications.  The
advantages of common formats and common content to allow reuse
of results between groups has been widely recognized, but the
practical funding situation has required in most cases that
individual groups focus on relatively narrow aspects of the
general problem.   Efforts have also been underway for years
within and between a number of groups to develop common
resources to promote interchange of data and to compare results,
and to reference and organize the results of the many groups
who have prepared valuable resources.  These very valuable
projects have helped mitigate the difficulty of preparing and
finding useful ontologies and lexical resources. However,
there is still little prospect that these multiple
projects will lead in the near future to a unified common
ontology and lexicon that has sufficient detail and
functionality to be adopted by a large number of groups as
a reference standard, and which can be used directly without
substantial modification for a variety of purposes
in research and practical applications.  Of special value
would be the development of a common defining vocabulary of
concepts and associated words and relations that would be
sufficient to define all of the specialized concepts and words
used in applications.  The ability to use a common vocabulary
to define the concepts and words in diverse applications will
provide a level of interoperability unavailable by any
other means, except for one-by-one coordination between
projects.   The question arises whether it is now possible
to build on the large body of existing data and experience,
to construct such a reference standard within a
tightly coordinated single project.  The goal will be to create
a database that is as inclusive as possible of all of the
results and intuitions resulting from previous research and
development efforts, and to include as many as possible of the
current practitioners within the project to build this resource.
    The main problem is that development of a basic but
realistically large ontology and lexicon for Computational
Linguistics research will require a project to coordinate a
group -- probably a consortium of dispersed academic and
industrial participants -- of a size that will require substantial
funding.  Though large by the standards of most NLP research
projects, such a coordinated effort would still be modest by
comparison with funding for important research tools in other
areas of science, such as space probes, particle accelerators,
or telescopes.  Skepticism about the possibility of congressional
funding for such a project is understandable, but there is ample
precedent for obtaining special congressional funding of tools
for research.  What is needed is to show that the costs will be
repaid by the usefulness of this database both for research
and for construction of advanced applications.  At a minimum
there should be a survey to identify the potential users of
a standard ontology and lexicon.  In the eventuality that
special congressional funding could not be obtained, this
will still be useful to help move toward building common
resources by other means.

    At the annual meeting of the ACL in Maryland in June 1999
I helped organize a "birds-of-a-feather" meeting to discuss whether
there is at present a need and an opportunity to build a large
but basic ontology and lexicon for use in NLU research and
applications.  Among the 23 that participated in the discussion, most
had expended some effort building lexicons and ontologies for
natural language understanding, but some members were present
who had not themselves participated directly in such efforts.  We
spent over an hour discussing mostly the technical question of what
kind of ontology could be useful for natural language understanding,
and the political questions of whether it would be practical to
attempt to get agreement at this time among ontology developers
with different views of how to proceed.  The view was almost
unanimous that such a project should be attempted, though it was
recognized as technically and organizationally complex.  There was
also a large degree of skepticism as to whether we could convince
congress to fund such a large project.  We had hoped to be able to
have a wider discussion among the general membership of the ACL,
but as it turned out the general business meeting ran well over its
allotted time, and when I raised the issue there was no time for
discussion, so a motion was made and passed that I should form a
committee to study the question and report back to a future meeting.
This note is the first request for participation in such a committee.
    The question of construction of a reference ontology for
Computational Linguistics and for database interoperability has
already been discussed over several years within the ANSI T2 ad
hoc committee on ontologies.  That ad hoc committee is no longer
actively meeting, and this note and its suggested formation of
a study committee is in part an attempt to fill the void left
by discontinuation of those discussions.   One of the conclusions
of those discussions was that substantially increased funding
would be needed for a coordinated effort, in order to move the
development of useful ontologies beyond the current stage in which
isolated groups each pursues its own ideas, which are generally
incompatible with or very difficult to merge with those of other
groups.  The present note is intended to bring the issues
addressed by the T2 committee to a wider group, and to form
a committee that can develop objective information that would
provide justification for the substantial funding needed for a
unified project.

   As mentioned, the complexity and size of such a project, which
would require a tightly coordinated effort with funding substantially
larger than a typical CL research project, makes it likely that
special funding would have to be obtained directly from congress.
To obtain such funding it will be necessary to show that there is
a significant group of established researchers who have been active
in building lexicons and ontologies, and who believe that building
a standard reference is technically feasible at present, and that
such a reference would be used widely enough to justify the expense.
One can find expressions of such a belief in private conversations
and in published papers, as well as in the existence of research
efforts to build common lexical and ontological resources.  To begin
the process of developing a well-organized proposal that can be
considered seriously by congress, what is needed is a more formal
study to present the findings of a broadly representative group
rather than of an individual or single research group.  This request
for participation in this study is only a first step in developing
such a proposal.
    The specific purposes for organizing this committee and the subjects

for discussion are:
(1) to determine the general characteristics of an ontology and
lexicon that would incorporate as much as possible of the results
and insights of those who have already spent many years doing research
on lexicons, ontologies, knowledge representation, terminologies,
and lexical semantics, and would be broadly useful for both research
and applications; and
(2) to estimate where and to what extent such a database, if built,
would in fact be used.  Quantitative data about potential
areas of use would be especially valuable, to demonstrate that
construction of such a database would be worth the cost.

    The structure of this committee is open to discussion.  I would
suggest that anyone with experience in any of the relevant fields
should be able to vote on any proposals for which a measurement of
opinion is needed, and those individuals wishing to participate as
voting members should inform me of that before the end of May.
Discussions will be conducted by e-mail (I will forward comments to
a list of interested persons), unless someone is willing to set up
a listserver for this purpose (perhaps an existing listserver should
be used?).  Individuals willing to prepare a report of the potential
uses of a defining ontology/lexicon in specific areas of research
or in applications would receive and summarize copies of any data or
suggestions relevant to their area, sent from any interested person.
The number of possible summaries is not limited, but will probably be
small.  Any individual is free to make any comments, and all comments
received will be forwarded to anyone wishing to receive them, unless
they are specifically intended not for distribution.  I do not
anticipate that at this stage any degree of agreement could
be reached about any details of the structure of a common ontology
or lexicon, but some summary could be prepared of the various
alternatives that might be suggested.  I hope that at the NAACL-2000
meeting in Seattle in the first week of May, some preliminary
indication could be obtained about how many individuals would be
willing to participate as voting members and/or report writers.
I do not have a fixed timetable in mind, but probably three months
will be sufficient time for interested parties to determine
potential uses and send in comments.  The timing of subsequent
actions will depend on the wishes of the voting members of the
committee.  All persons interested in this project in any way
should contact me by e-mail ([log in to unmask]) or telephone
(908-561-3416).  Suggestions about how to organize an
informal study of this type would also be welcome, but need to
be sent soon to be useful.

    It will be worthwhile to include in this study a summary of
all ontological and lexical resources currently available, and
I hope that some representative of every group that has built
any form of ontology, terminology, or other lexical resource,
which is now available to the public or might become part of a
common reference ontology/lexicon, would send me a brief summary
of their projects and a reference to the location of any existing
data available publicly.  There are already several web sites
on which pointers to the locations of such resources are listed,
and the owners of those sites and those who have prepared other
lists of available resources are encouraged to send a copy of
the lists they have already prepared.  The complete summary of
references to such resources submitted will be published as
part of the report of the committee.

    The data that are most needed to determine potential utility
of a reference database will be estimates of how much such a
common ontology or lexicon would be used.  For this purpose, anyone
who would be likely to even try using it should send a note indicating
the type of system in which it would be used and how it would be
used, and how much more efficiently the system might function.
I would expect that anyone currently using an ontology or semantic
network would want to try such an ontological lexicon, and if there
are those who would not try it, the reasons for this skepticism
will probably serve as useful input.
   One of the important questions to be answered is whether one can
estimate potential utility in quantitative terms, and if so, how.
The likelihood of the ontology being used in one's own system
may be expressed in any way, but at least three levels can be
distinguished: (1) those who would be willing to participate in
construction of such an ontological lexicon; (2) those who would
be likely to adopt a standard ontology or lexicon, if it existed;
and (3) those who would try using a standard ontology or lexicon,
to test its utility.

    Descriptions of potential commercial uses would be especially
valuable for convincing congress that funding is justified.
For example, estimates have been made that electronic commerce
over the internet will amount to 425 billion dollars by 2001 (IEEE
Intelligent Systems, Jan/Feb 1999 "Let's Go Shopping" by Michael
McCandless, pp. 2-4).  Labor costs in sales transactions tend to run
about 10%, so the costs of executing those transactions would be
about 40 billion dollars.  If these costs could be reduced by 1% due
to efficiencies generated by the use of a standard knowledge
representation scheme, those cost savings would amount to 400
million dollars per year. The total cost of the development of such
an ontology would then be paid back in less than 6 months.  One can
make similar estimates for other activities which use advanced
computer programs, and find similar likely savings.  Thus even a
miniscule improvement in the efficiency of computer programming
or the use of computer programs would appear to make this project
cost-effective.  However, estimates of this type will be far more
convincing if there are those involved in the development or use of
programs which have or should have semantic elements, and who
could provide more accurate and objectively-based estimates for
specific examples.
    In the best case, an industrial group who maintains a database
that already uses an ontology to enhance its functionality
might estimate, for example, that an ontology of the type
described would likely improve the efficiency of the program
by, say, 5%.  This number, multiplied by annual sales of the
program, could provide a crude estimate of economic benefit.
There are several obvious difficulties in making such estimates,
starting with the fact that we don't know what the final
database will look like.  But even very crude estimates from
people familiar with a potential use will be better than wild
guesses from those with little familiarity.  Groups which
have already built an ontology or a semantic lexicon can review
the costs of development of their own system and determine, if
a common ontology would be useful, the direct cost savings that
would occur in adopting a standard ontology rather than constructing
an enhanced version of their own system.

    Even without an economic justification of that type, building
this database should be justifiable even if it is used primarily as a
research tool.  Accordingly, I hope that we can obtain comments
from all individuals who would be likely to use such a tool in their
research or in building applications, as well as those who wish to
comment on the desirable structure of such a database.
    I plan to organize a birds-of-a-feather meeting at the
upcoming NAACL-2000 conference in Seattle (April 29-May 3) where
those who are willing to consider serving on this committee can meet,
and discuss questions of form and substance of a study such as this,
as well as any comments that have been received at that point.
Accordingly, responses should be sent to me by e-mail if possible
before the 27th of April, or they can be presented and discussed
at the meeting in Seattle.  This study will continue for at least
three months, so comments will be welcome and are likely to be
valuable after the meeting as well.
    In the discussions I had concerning this topic with other
attendees at the 1999 ACL meeting, the first question was of course
what type of ontology is being proposed.  The general structure as
well as detailed technical questions can only be resolved in the
course of preliminary discussions among those who will participate
in the construction of the database, as well as in the construction
phase.  But for the sake of discussion, I have described below some
characteristics that will likely need to be included in such a
database.  The final form of the ontology, if it is to be useful for
Computational Linguistics, will have to include substantial lexical
knowledge, or will have to be tightly integrated with lexicons built
separately.  Rather than call it an "ontology" it might better be
referred to as an "ontological lexicon," although there should be a
core conceptual component in the ontology which will be language-
neutral.  One of the purposes of formation of this committee is to
obtain a wider range of comments concerning desiderata for the
structure of such a database.
    In addition to questions about how such an ontological
lexicon would be structured, many at the ACL meeting had other
questions.  I have reproduced below most of the questions that were
asked, and indicated some potential answers.  It may well be that
nothing suggested here will ultimately find itself accepted
unchanged in the final result of construction of this database, but the
important issue is that construction of some such a database will be
essential to provide a common tool that will permit more effective
widespread collaboration in research toward human-level
understanding and generation of language.

========================================
What Kind of Ontology is Being Proposed?
========================================
    What is being discussed here is the need for a database having
two main components: (1) an upper ontology of fundamental concepts,
represented in logical format, which are sufficient to serve as the
building blocks for construction of all of the more complex concepts
that are used in any given field; and (2) a basic lexicon of defining
words, in which the word meanings are represented using the same
set of fundamental concepts, and which are sufficient to define
all of the words of the language.  Each word in the lexicon will
also have an associated definition using the defining vocabulary,
which will in some cases look like an ordinary dictionary definition.
Over time, both the ontology and lexicon can be expanded to
include more specialized or less common concepts, but the main
goal for the initial phase should be to specify the minimum set
of defining concepts, semantic relations, and axioms for the
ontology, and the minimum set of defining words for the
associated lexicon.
    This description evades some controversial issues regarding
what constitutes "words" and "definitions".  It is understood that
many polysemous words have vague or plastic meanings, dependent
on context, and for such words an exhaustive list of meanings
cannot be specified; and many words cannot be defined by necessary
and sufficient conditions.  What can only be recorded in a
database of this kind are the necessary characteristics of
word meanings, and perhaps some markers indicating when variations
in meaning can be expected in linguistic usage.  This will be
an attempt to record as much as can be agreed on about basic words
and concepts at the present state of the field.  Applications that
need to handle ill-defined words will need additional structure
beyond what can be included in a standardized lexicon.
    The conceptual component of this database would be equivalent
to an "upper ontology" or "top ontology" (although this term is
used by different people to indicate ontologies of somewhat
different sizes).  Specifying the meanings of words using a basic
ontology of this type constitutes in effect a theory of the
meanings of the words.  A realistic lexicon will need to include
not only single words, but fixed collocations and probably also
word combinations that are not normally considered idioms but
have some non-compositional character.  The lexicon can include
not only the word meanings in logical format, but any other
data associated with word meaning or usage which is useful
for applications.  For example, in addition to part-of-speech
or etymological data, the lexicon could include verb case frames
which would be duplicative to some extent of data in the verb
definitions, but in a different format, perhaps easier to use for
some purposes.  Statistical data on word associations would be
another useful component.  Though not essential, it could be
easily included when available.

    Specifics of what will be included and how the data will be
structured can only be decided by those participating in the
construction of the database; the remaining comments in this
section are personal suggestions, which may not be adopted by
the project participants.

   The conceptual elements in the ontology will be defined in a
logical format, but there are two principles which could make
the database more widely acceptable and easier to use:
(1) concepts which are not lexicalized in any language as
single words or fixed collocations can be included in the
ontology, but should be used only where there is some cogent
need; and all concepts in the ontology will have an associated
definition in some language (usually English).  (2) Ideally
there will be a "definition parser" that can take such a defining
string and produce the logical structure that it is intended
to define.

    The emphasis in this project is on the most general words and
concepts, so that a common defining vocabulary of concepts can
be developed which, if used for defining terms in specific
applications, will allow some significant level of conceptual
communication between applications developed by independent
groups.  Applications that process complex information but
are not required to understand linguistic phrases, such as
database applications or electronic commerce, can use the
ontology, and in theory could ignore the lexicon.  Linguistic
applications would use the lexicon, and, if any level of
conceptual understanding is required, would also
use the word definitions in logical format, which will usually
also require the use of the basic ontology.  (In some cases
a linguistic application may use the lexicon and associated
definitions with minimal reasoning, and the lexicon would
function in such cases as a thesaurus or simple semantic
network, such as WordNet).


   Different ontologies have already been developed by a number
of different groups for various purposes, but in general their
structures are so different that transferring information from
one system to another is very time-consuming or error-prone.
The difference between this ontological theory and others which
have been proposed thus far lies mostly in the size of the database
and the extent to which it will both include and represent a
consensus of the different theories (i.e., ontologies and lexical
semantic representations) that have been developed thus far by
independent groups.  What would be very useful for both research
and applications development is to have at least one well-developed
defining vocabulary freely available to all potential users,
constructed by representatives of most or all of the existing ontology
and lexicon groups and containing as much as possible of the
compatible information which each of these groups could contribute
to a common effort.  In addition to the core database, user interfaces
and applications programming interfaces should be developed, as an
integral part of the project, to make the database as easy as possible
to learn and use.
    The representations of the concepts, and through them the
meanings of words, will need to be specified ultimately at a logical
level that will allow automatic reasoning.  The existing Knowledge
Interchange Format (KIF) and Conceptual Graphs (CG) standards could
serve as well-defined theory-neutral formats for storing the meaning
representations.  To be useful for computational linguistics, a
considerable amount of lexical information should also be included.
This distinguishes the proposed database from that of CYC, which
placed primary emphasis on utility in reasoning.  Another important
distinction is that the database must be public domain or at least
freely and easily available over the internet for research, such
as is the WordNet system.  Without the free availability to any
potential research or applications group, developing the
necessary agreements between groups may be impossible, and most of
the utility will be lost.
    The ontology that will emerge from such a project will most
likely have some variant of the typical structure of a set of entities
connected by relations, since this is the basic model of meaning
representation which has been universally adopted, though with
some significant differences between implementations.  The
relationships may be thought of as semantic relations or as axioms
of the ontology, but it is understood that to be useful for reasoning
the semantic relations must be defined with sufficient precision that
the logical implications of one entity having a specific relation to
another can be calculated unambiguously.  Although in many ontologies
the hierarchy has receive the most attention, it is equally important
that the semantic relations be fully agreed upon and well-defined.
The set of basic concepts and semantic relations needed will be
those which are necessary and sufficient to provide logical
definitions of any of the concepts, and by extension, words,
which will be used in applications.  In effect, what is needed is
to create a dictionary with definitions of the words, and a parallel
ontology with the same definitions expressed in a logical format
suitable for automatic reasoning.  The lexicon that labels the
concepts of the ontology should include all of the basic words that
are needed to define all of the other words of the language; the
"words" of the language must eventually include all collocations
which are to any degree non-compositional, that is, whose meanings
cannot be deduced as a predictable combination of the meanings of the
individual component lexical strings.
   The lexicon cannot at the initial stage be comprehensive, but it
should also contain those common collocations, such as those which
are produced by the lexical functions of Mel'cuk, which are either
essential for generation of fluent colloquial language, or so
commonly used that their inclusion will improve the speed or
accuracy of the language understanding process.
    As a practical matter, to demonstrate the potential uses of
such an ontological lexicon and to facilitate development of a user
interface that will permit widespread use, there should be a detailed
implementation of this basic defining vocabulary to define
specialized concepts in at least two different areas.  Two that come
to mind are, for example, the medical area, where the basic defining
vocabulary could be integrated with the UMLS system and its
metathesaurus; and the military area, where significant effort has
already been expended to apply the CYC ontology.  These two are
by no coincidence areas of interest to governmental agencies.
Integration with other specialized ontologies or lexicons might be
proposed and performed by individual groups as part of the project.
Enterprise models, manufacturing, electronic commerce or planning
ontologies would be additional candidates.
    The primary motivation for developing a common theory of
meaning is to allow a greater degree of re-use of research results in
computational linguistics, as well as more direct communication
between different implemented systems which have a linguistic or
conceptual component.


============================================
Why do we need a common defining vocabulary?
============================================
   Any difference between two systems in the internal representation
of words or concepts must inevitably lead to some difference in the
inferences that the two systems make from the same data.  Thus
without some common basis for defining the meanings of the different
concepts used in different systems, the transfer of knowledge
between systems will be impossible, time-consuming, or
highly error-prone.  The need for a common vocabulary of defining
concepts is felt not only in the field of natural language
understanding, where communication is the primary goal, but also in
other fields of Artificial Intelligence, wherever conceptual
information painstakingly entered into one system could be useful
in another system.

    It is clear that in some areas of research in Natural Language,
semantic representation of word meanings is less important than in
others.  Research in speech-to-text conversion, for example, and in
parsing methodologies, has progressed without the use of semantics.
Statistical methods have also been shown to be useful for some
practical purposes, though the extraction of the meanings of texts is
beyond the capabilities of such a methodology by itself.  It is also
true that groups doing research with systems which will not interact
at a conceptual level with other systems have a great degree of
freedom in choosing representations of meaning which may be
suitable for their purposes even if not usable in other systems.  We
would hope that groups whose research does not immediately
require detailed semantic representation of meanings will
nevertheless recognize its importance for the progress of research in
language understanding, and not raise objections to this project
unless the objections address the feasibility of the goal.
    The developers of an ontological lexicon will be those
groups working specifically on methods to represent word
meanings, but the need for a common representation of meanings of
words and texts is felt directly also by those whose research involves
some level of understanding, such as in information extraction,
message understanding, word sense disambiguation, text categorization,
machine translation, and database interoperability.
    The difficulties caused by a lack of common conceptual
representations impact not only NLU and the database and expert
systems that CYC has been applied to; it affects many areas of AI.
In a recent issue of the IEEE Intelligent Systems (January/February
2000) several commentators discussed the state of AI and some of
those comments reflect this problem indirectly:
Nils Nilsson commented that "AI shows all the signs of being in
what the late Thomas Kuhn called a pre-paradigmatic, pre-normal-
science stage.   It has many ardent investigators, arrayed in several
camps, each claiming to have the essential approach to intelligence
in machines.. . .  It might be that intelligence  is the kind of
multiplex for which no single science or paradigm will ever emerge."
Donald Michie stated: "The most notable nontrend [in AI] has
resulted from consistent disregard of the closing section, Learning
Machines, of Turing's 1950 paper. A two-stage approach is there
proposed:
1.  Construct a teachable machine.
2.  Subject it to a course of education.
   Far from incorporating Turing's incremental principle, even the
most intelligent of today's knowledge-acquisition systems forget
almost everything they ever learned every time their AI masters turn
to the next small corner of this large world."
A common basis for representation of knowledge will help to
overcome these problems, and help to move more toward the normal
scientific paradigm, enabling more rapid advances by allowing
investigators to investigate the same phenomenon and compare
details of results more directly.  In computational linguistics
research, having at least one common detailed theory of word
meanings for the defining vocabulary will provide a powerful tool
for progress toward the ultimate goal of human-level language
understanding.

===============================================================
Wouldn't it be better to develop a common ontology cumulatively
by contributions from existing research groups rather than try
to build a larger unified project?
===============================================================
   The construction of an ontological lexicon for natural
language understanding is different in several important ways from
most areas of scientific research, where ideas and results from small
independent groups provide the bulk of the individual contributions
to evaluate or elaborate the theories of each field.  The
predominance of original contributions from small groups is true in
most areas of natural language research as well, but for construction
of a large ontology and lexicon for use as a tool in research, the
usual research process less effective.  The main problem is the size
and complexity of a realistic ontology, and the intimate and multiple
interrelations of its component parts.  Specifying the meanings of
the defining vocabulary is to build a fundamental ontology of concepts
and then to construct a theory of the meanings of words using those
concepts.  This endeavor has more of the character of an engineering
project than of a research project, in that it is the construction of
an artifact which has many complex interacting parts.  It may be in
theory possible to achieve the same result eventually through small
independent contributions of ideas and elements, but such a process
is likely to be much slower than a coordinated project, and will be
less likely to achieve the goal of a widely accepted reference
sta`ndard within any foreseeable time frame.  In addition, the time
lost in pursuing the development of a common ontology through
uncoordinated effort may well prove eventually much more
expensive, through the lower efficiency both of research and of
implemented programs developed in the interim, than would the
development of the same database by a single adequately funded
coordinated effort.  Furthermore, the problems of coordination of
groups with different approaches to ontology development,
admittedly difficult even in a single properly funded project, might
well be insurmountable without the impetus of deadlines for
agreement on specific subproblems within an overall plan of
development.
   One possible alternative is the elaboration of an existing
ontology, such as the WordNet, by the cumulative addition of new
functions or data.  This will, one may hope, proceed in any case
until a coordinated project is funded.  But in order to accumulate
into a unified system, there would still need to be a prime
coordinator - in this case presumably the WordNet group.  Their
own views would then necessarily predominate, and since these
have been driven by specific goals and objectives, which are
different from the goals of other groups, the resulting database
would not represent the best common approach to the varied
problems, as would a project initiated de novo for the specific
purpose of answering a wide range of research and practical goals.
It is also difficult to imagine that the total cost of proceeding
in that fashion would in the end be any less than a single
coordinated project, which would also contain input from WordNet
as well as from other existing systems.
    The worst-case scenario is one in which several commercial
concerns develop proprietary versions of a natural-language
ontology, of which the largest part is not publicly available.  That
is currently the case with the CYC project, and it appears to be the
direction in which Microsoft's "MindNet" project is heading.  If
such a situation develops, there will not be one but several
competing "standards", none of which will be easily available to
researchers, and even if available to some degree, will not be able
to be enhanced and redistributed by most of those who could improve
such a system.  Such systems will not serve the purpose of providing
a common test bed in which new ideas for representing word
meanings can be tried by many research groups in realistically large
systems, with results distributed to the research community at large.
Proprietary systems are also likely to be less reliable than a public
one and their behavior unpredictable to anyone outside the
development group.

=================================================================
Would non-U.S. groups be eligible to participate in this project?
=================================================================
    Much important work on ontologies has been performed
outside of the U.S., and I would expect that participation by non-
U.S. groups would be welcomed, indeed would be essential if the
resulting ontology, which should be language-neutral, is intended
to serve as a standard throughout the scientific community.  Since
the emphasis would be on creating a defining vocabulary of
general concepts sufficient to define all specialized concepts,
the experience of those whose native language is other than
English will be particularly valuable to recognize when
useful basic concepts are lexicalized in one language and
not in others.  There are already several European projects
which are aimed at the construction of common ontological and
lexical resources, and it would be great loss if those groups
did not participate in an inclusive effort.

    The language-specific elements of the lexicon will of necessity
concentrate first on English, since creating a computational lexicon
even of one language is already a very large task.  Groups from the
UK could of course work on the English lexicon.  But if at all
possible, groups with experience in automatic translation or other
multilingual applications should be requested to participate, since
some of the more subtle and difficult problems in knowledge
representation may be highlighted by the difficulties found in
accurate translation.
     It is difficult to predict to what extent the inclusion of
lexicons for other languages will be feasible; groups which
presently concentrate on translation will presumably want to
include their parallel lexicons for languages other than English.
Ideally, the European research funding agencies might fund European
groups willing to coordinate their work with this project, who
could concentrate on non-English languages.

================================================================
My notions of how to represent concepts changes every few weeks.
How can we fix on a single representation at this time?  Do we
know enough at present to justify a major project?
================================================================
   It goes without saying that an ontological lexicon, like the
language it represents, will change over time, but a legitimate
question is at what point it is appropriate to undertake a first
effort to construct a standard tool that can be used and tested
by the entire research community.  There have not been any major
fundamental changes in the prevailing entity-relationship paradigm
for representing knowledge over the past ten years, and the paradigm
has been sufficiently well investigated at a fundamental level that
there seems to be no reason to delay trying to build a consensus
ontological lexicon based on the best knowledge now available.
This will provide a research tool that can help to discover the
strengths and weaknesses of different aspects of this paradigm,
and it can include all the elements deemed important by those who
have been studying meaning representation for some time.  The
database can then be thoroughly and widely tested for conformity
to the realities of language use, and for utility in reasoning
about data.  The main motive for this project is the observation,
from prior experience, that the fundamental concepts of any language
are so intimately connected with each other that no theory of the
meaning of any of its component concepts can be tested in a realistic
setting unless some consistent representation of the entire
fundamental vocabulary is available.  We therefore need some
starting point with a realistically large database representing most of
the fundamental concepts of a language, in order to make effective
tests of whether any specific individual components conform to the
way people actually use words and concepts.

================================================================
For how long will the ontology constructed be useful?  Isn't it
likely to change and need modification or replacement?
================================================================
    Based on the lifetimes of existing ontologies, we can expect
that a major effort at developing a standard ontology will result in a
database that will be useful for research and practical purposes for at
least ten years.  To avoid getting outdated, the ontological lexicon
will need a core group to provide continuing effort at maintenance,
at a minimum level of effort possibly five times less intense
than for the initial development.  It is conceivable that eventually
some fundamentally different structure for meaning representation
will be proposed and widely accepted, in which case it would be
difficult to predict how much of the structure of this proposed
ontology would be reusable.  But more likely the ontology will
continue to be useful for decades by modification, replacement, or
addition of new components, with most of the structure remaining
stable for years.  It is also unlikely that any new meaning
representation paradigm could gain wide acceptance unless some
substantial effort such as this provides a basis for thorough testing
of the entity-relation model on a realistic scale.
    As a theory of the meaning of words, this database will
doubtless be modified and elaborated, as are most scientific theories.
Theories in general are tools for organizing research; they provide a
framework in which to formulate tests to confirm or refute aspects
of the theory.  They are useful for a time to make collaborative
research on a topic possible, after which they may be modified or
abandoned.  In a theory with as many individual parts as an upper
ontology, we can assume that some parts will be found inadequate
for some purposes, while others may remain unmodified for a long
time.  The core maintenance group, or perhaps a committee with
broad representation, would be responsible for making and
publicizing the changes in each new revision.  Having this theory
easily available to the entire research community will maximize the
likelihood of finding and addressing inadequacies in its structure.

=============================================================
Ontologies have not been shown to be notably useful for NLU.
Why spend resources building a bigger one?
=============================================================
    There is apparently a widespread notion that ontologies, and
specifically the CYC ontology, have been tested for utility in
Natural Language Understanding and have not proved useful.  It is
important to address this perception.  In fact, attempts to use CYC
in natural language have been very modest in terms of time spent, and
the main virtue of CYC, its logical structure, has scarcely been
tested at all in NLU applications.  It is also important to recall
that CYC was not designed with use in NLU as a primary objective
(as would the ontological lexicon suggested here), although Lenat
had expected it would be useful for that purpose.  CYC has two
other important flaws which would not apply to an ontology built
as suggested here -- (1) CYC was built by a single group with
a specific viewpoint, and did not include input from many other
practitioners of diverse schools of knowledge representation,
ontology and lexical semantics.  Regardless of its internal
consistency, it cannot serve as a focus to bring together a large
number of groups to use it as a common reference standard; and
(2) most of CYC is not publicly available, and use of CYC
presents difficult legal issues.  Although it can be useful
for specific industrial contractors, its lack of public
availability make it unsuitable for use as a research tool;
even when made available to academic groups, detailed results
of research cannot be freely described, nor modified versions
redistributed to other groups.
    The study that may most directly account for the perception
of CYC's inadequacy was performed in 1996 by Nirenburg's group
at NMSU ("An assessment of Cyc for Natural Language Processing",
MCCS-96-302, available on the Web at:
http://crl.nmsu.edu/Research/Pubs/MCCS/Abstracts/mccs-96-302.htm).
This study of the utility of CYC for Natural Language research
found that several desirable features were absent.  It did
not, however, suggest that the existing structure could not be used,
rather that it needed additional components or structures to be more
useful.  It did not make any negative conclusions about ontologies
generally, and indeed that study group has its own ontology which
it finds more directly useful for its purposes.
    Perhaps of greater relevance is the widespread use of
WordNet and EuroWordNet.  Although this semantic network does
not qualify as a logic-based upper ontology as would the basic
ontology which would be constructed as suggested here, it does
contain many conceptual relations which would probably be
widely accepted as part of the larger ontological lexicon
which could be constructed if adequate funding were available.
The wide use of WordNet does provide strong evidence that
when well-structured and easily usable resources are publicly
available, they will prove to be valuable tools for research.
This is scarcely surprising, as progress in many types of
research is limited by the tools available.
     Since there has not yet been an ontology constructed with
even close to the amount of detail that is needed for understanding
of language, it is far too early to draw conclusions as to how
Useful a fully-developed and publicly available ontology would be.
One of the purposes of developing a comprehensive ontological
lexicon would be to discover how useful the present ideas about
knowledge representation really are, without the impediments of having
multiple small and incompatible sets of data on word meanings.
Smaller ontologies have in fact been shown to be useful to some
extent  in language-understanding tasks, such as disambiguation, but
thus far those available have not been shown to dramatically
improve performance.  Nor should they necessarily.  As mentioned,
a comprehensive ontology does not by itself constitute a language-
understanding system, there are many additional aspects of
language understanding systems that must be developed as well.
     Although an ontology is not the only component of a
language understanding system, or even the main one, and its
usefulness depends directly on the systems in which it is used,
some form of common ontology is a necessary prerequisite for sharing
research results in language understanding, wherever the actual
meanings of linguistic expressions need to be represented.  Many
specialized ontologies have been constructed which are not
designed to be used in language understanding.  But until a common
representation of word meanings is used by more than one or two groups,
advancement toward human-level understanding of language will be very
difficult and is likely to be slow and inefficient.  The proposed
ontology will be one intended to be useful for NLU as well as for
other purposes, such as database interoperability.  It will therefore
need to be connected intimately with the lexicon, and as much as
possible of the type of detailed lexical information that is found
in Melcuk's Explanatory-combinatorial dictionary will have to be
included.  As mentioned above, what is needed is better thought
of as an ontological lexicon.

====================================================
Would there be any images or graphical information
representation in the ontology?
=====================================================
    It may be true that some degree of imagery or graphical
representation may be required to adequately represent certain
concepts or word meanings.  Whether it will be feasible to include
such data in the first version of an ontological lexicon will have
to be decided by those participating in the organization of the
effort.  It will be helpful if individuals who have worked on
graphical information representation were to participate in this
study.

==============================================================
Different people use different internal ontologies, and
to some extent different lexicons.  How can we include
all of those differences in a single consistent database?
==============================================================
   In order to serve as a completely accurate medium of
communication between agents, the word senses of a language must
be identical between speaker and listener, or some degree of
miscommunication or ambiguity will result.  It happens in human-
to-human communication that use of words in different senses by
different people causes errors in the communication process.  It will
also be true that in human-to-computer communication similar
differences in internal representation will lead to some
miscommunication, though this can be eliminated in computer-to-
computer communication.  Special procedures for recognizing when
variants of meaning are being used will probably have to be part of
the implementing systems, and may not be includable in the
ontological lexicon itself.  Words that are commonly used in variant
senses, or have productive polysemous meanings, can be marked as
such, and the broadest senses can be included, even though the
procedures for recognizing variants of meaning may not be
contained within the lexicon.  These are the cases where recording
collocational use may be especially helpful to disambiguate the
sense.
     It is necessary to build at first a basic lexicon and ontology
of words which identifies the most common senses that are used by
almost all native speakers of a language, and from that subsequently
to build up and include less common or idiosyncratic variants,
wherever such variants have some significant level of usage.  The
differences in their internal lexical representation that people
have, if they are sufficiently widespread, may have to be treated
similarly to multiple discrete senses of words, or the semantic
plasticity of polysemous words.  In the real world, of course widely
variant use of language can be observed; any idiot or psychotic
individual may produce a string of seemingly linguistic utterances
that are completely uninterpretable by any other person, however
skilled in the language used.  The project is intended to produce
only a basic reference vocabulary, and the recording of highly
individualistic, poetic, and idiosyncratic usage of words will be
beyond its scope.  Most specialized uses will have to be dealt with
by specialized systems built to handle such variation in usage.
It is the common defining vocabulary which would be the main concern,
though the inclusion of some standardized or common uses of specialized
technical words will be valuable, limited only by the time and
resources available for extension of the database core.

=================================================================
Will funding for construction of such an ontology reduce funding
for other areas of Computational Linguistics?
=================================================================
    In any recommendation made to congress for funding of this
project, it must be strongly emphasized that the creation of a
standard ontology/lexicon will not substitute for other aspects of
computational linguistic research, but is only a tool for such
research.  The reduction of funding for other aspects of CL research
would be counter to the purpose of building the ontology, and would
squander the resource that would be built at significant expense.
Those who contact funding agencies or members of congress to
recommend this project need to be sure to emphasize this point.

======================================================================
Will recommendations by an ACL committee for congressional funding
constitute lobbying and jeopardize the tax-exempt status of the ACL?
=======================================================================
     A study of public issues which includes comments on the
need for and effects of government action does not constitute
lobbying, and is performed routinely by institutions and think tanks,
such as ECRI, without affecting their tax-exempt status.  The ACL
will not as an institution make recommendations directly to
members of congress.  Individuals who are interested in the subject
may cite an ACL study to support the need for funding.  An
unfunded and relatively informal study of this type is unlikely
by itself to carry sufficient weight to move congress to action,
but ideally it could prompt the organization of a more formal study
of the need for funding of a standard ontology, for example by the
National Academy of Sciences, or by think tanks concerned with
technical issues, whose opinions are valued by members of
congress.

=======================================================================
How can we expect that ontologists and lexical semanticists with
different viewpoints could ever be induced to agree on a common
approach?
 ========================================================================

     It will indeed likely be difficult to forge agreements on
specific issues, but where there is a recognition of the need for
compromise, it can be accomplished.  Building research resources is
in many respects an engineering rather than a research activity, and
the mindset required for such a task is quite different from the
attitudes which are successful for basic research.  One example of
this difference was eloquently narrated in Kip Thorne's book "Black
Holes and Time Warps" in which he described the analogous
difficulty in coordinating several teams, each accustomed to basic
theoretical research, in a new effort to design and build an expensive
interferometric detector for gravity waves:

"Within each team the individual scientists had free rein to invent
new ideas and pursue them as they wished for as long as they
wished; coordination was very loose.  This is just the kind of culture
that inventive scientists love and thrive on, the culture that
Braginsky craves, a culture in which loners like me are happiest.
But it is not a culture capable of designing, constructing, debugging,
and operating large, complex scientific instruments like the several-
kilometer long interferometers required for success.
  To design in detail the many complex pieces of such
interferometers, to make them all fit together and work together
properly, and to keep costs under control and bring the
interferometers to completion within a reasonable time requires a
different culture: a culture of tight coordination, with subgroups of
each team focusing on well-defined tasks and a single director
making decisions about what tasks will be done when and by whom.
  The road from freewheeling independence to tight
coordination is a painful one.  . . ."

   He continues that with reluctance, and prodding from the
funding agency, the freewheeling and independent scientists made
the necessary adjustments.  An ontological lexicon for
Computational Linguistics is of course a different type of research
tool from a gravity-wave detector (and probably of much more
immediate practical utility), but the need to build a unified structure
which is tightly coordinated and internally consistent may be even
greater than that for building physical measuring instruments,
because of the likely sensitivity in an ontology to inconsistencies
between even widely separated parts.  Given the imperative for close
coordination in ontology construction, is there a plausible way to
achieve the necessary cooperation of groups with disparate
viewpoints?  I will suggest one possible scenario.
    If the prospect of organizing development of a standard
ontology, as suggested here, reaches the stage where funding looks
like a realistic possibility, discussions or a conference should be
organized among those who would want to participate in its
construction, to determine how many of the disparate systems could
be integrated into a single consistent system.  In such discussions,
the teams will develop some appreciation of the likelihood that their
own views may or may not be adopted, intact, or in modified form.
Since the most important goal will be to create a database that will
be used by the largest number of research teams, at some point
disagreements about what formats or approaches to adopt will
probably have to be resolved by some form of voting among
participating groups, and he project director will need to
be able to resolve any issues not amenable to the voting approach.
Any group which recognizes that its own approach is incompatible
with the majority and is likely not to be adopted, can try to argue for
its technical superiority, but if the arguments are not accepted, such
a group will face the choice of participating and adapting its own
system to the dominant approach, or not participating, and
continuing its own independent line of research.  There will
presumably be some groups interested in exploring novel
approaches to knowledge representation that will want to continue
along lines different from that adopted by the majority.  However,
from discussions I have held with people involved in investigation
of word meanings, there appears to be a wide recognition of the
need for some common database, and many or most are likely to
participate in such a project.
    By the time that project proposals need to be submitted, there
should be some preliminary agreement as to the likely outline of the
general structure of the database that will be developed.  The
disagreements over details will need to be resolved in the course of
actual funded development, but there will need to be some mechanism,
whether by voting of an executive committee or decision of a project
chairperson, to resolve residual disagreements by fiat.  The manner
of selection of the project chairperson would ideally include
substantial input from the likely participants in the project.
    It is likely that to accommodate input from as many as
possible of existing groups, the number of persons funded for this
project will approach or exceed two hundred over an initial
development stage of three to five years.  The required funding for a
project of that size will be close to two hundred million dollars
($200,000,000) over the five years.  This will almost certainly
require a special appropriation from congress.  Other areas of
science, including highly theoretical fields with little immediate
practical applications, have succeeded in obtaining funding for
projects comparable to and often much larger that this (the
*annual* maintenance budget of the Hubble telescope is about
$200 million).  The possibility of congressional funding is
realistic, provided that an adequate justification can be agreed
upon among practitioners in the field.  That is the purpose of
forming this committee, and I hope that all of those who may have
some use for an ontological lexicon will respond with information
about potential uses that will allow us to demonstrate the
cost-effectiveness of such a project.

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
June 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999
October 1999
September 1999
August 1999
July 1999
June 1999
May 1999
April 1999
March 1999
February 1999
January 1999
December 1998
November 1998
October 1998
September 1998
August 1998
July 1998
June 1998
May 1998
April 1998
March 1998
February 1998
January 1998
December 1997
November 1997
October 1997
September 1997
August 1997
July 1997
June 1997
May 1997
April 1997
March 1997
February 1997
January 1997
December 1996
November 1996
October 1996
September 1996
August 1996
July 1996
June 1996
May 1996
April 1996
March 1996
February 1996
January 1996
December 1995
November 1995
October 1995
September 1995
August 1995
July 1995
June 1995
May 1995
April 1995
March 1995
February 1995
January 1995
December 1994
November 1994
October 1994
September 1994
August 1994
July 1994
June 1994
May 1994
April 1994
March 1994
February 1994
January 1994
December 1993
November 1993
October 1993
September 1993
August 1993
July 1993
June 1993
May 1993
April 1993
March 1993
February 1993
January 1993
December 1992
November 1992
October 1992
September 1992
August 1992
July 1992
June 1992
May 1992
April 1992
March 1992
February 1992
January 1992
December 1991
November 1991
October 1991
September 1991
August 1991
July 1991
June 1991
May 1991
April 1991
March 1991
February 1991
January 1991
December 1990
November 1990
October 1990
September 1990
August 1990
July 1990
June 1990
April 1990
March 1990
February 1990
January 1990

ATOM RSS1 RSS2



LISTSERV.BROWN.EDU

CataList Email List Search Powered by the LISTSERV Email List Manager