On 4/22/07, Henrik Theiling <[log in to unmask]> wrote:

> So your main criterion would be predictability of semantics?  If
> predictable => no new word, if not predictable => new word.  This
> seems, well, very reasonable for composing a lexicon.  Of course there
> will be difficult cases, but let's ignore them for now.
> This means that for counting a conlang's words, we probably should:
>   - also count phrases ('bubble sort algorithm') and idioms
>   - not count lexicon entries that are due to irregular forms
>     ('saw' cf. 'see')

I can see an argument for counting irregular and
especially suppletive forms as separate words --
from the POV of the learner they increase the
amount of vocabulary one has to learn.  But in
gauging a conlang's completeness and expressivity
by its lexicon size, one would of course not count
irregular and suppletive forms separately (unless
perhaps some of them have a special sense not
shared by other forms of the "same" word?).

>   - count polysynthetically constructed words several times,
>     excluding structures that are semantically clear operations,
>     but counting all irregularly derived concepts

I don't think we need to treat polysynthetic words
specially, as such.  Would it make sense just to
count the number of morpheme boundaries in a
word and see how many of those result in a
semantically opaque compounding?
Even with that qualification I'm not sure I agree
with you -- it seems to me that a semantically
opaque compound built of 5 morphemes and
another one built of 2 morphemes should both
count as one word in the lexicon.

> This seems quite reasonable.  Do you also think it's a good way of
> counting?  It also looks undoable since the lexicons are generally not
> structured like this.

Of course in starting a new lexicon for a new language one
could easily have a field for "semantic transparency",
or perhaps an integral field indicating how many words
(or "lexical items") each entry counts for (1 for root words
and opaque compounds, 0 for irregular forms and transparent compounds;
1 for idioms and stock phrases?).

On the other hand, transparency/opacity is a
continuous rather than a boolean quality.  Some
"transparent" compounds are more tranparent
than others, some "opaque" compounds are more
opaque than others; and the same is true of
idiomatic phrases.  So maybe the semantic transparency
field gets real numbers ranging from 0.0 to 1.0, and
the overall word count for the language would probably
be non-integral.

On the gripping hand, maybe the "semantic transparency"
needs to be applied at the morpheme boundary level
rather than the word level.  For instance, in E-o
"el-don-ej-o" there are three morpheme boundaries,
one perfectly transparent (ej-o), one somewhat
transparent (between el-don and -ej), and one
almost completely opaque (el-don).  We might
assign them transparency (or rather opacity)
scores of

el-  don -ej -o
  0.95, 0.20, 0.0

or thereabouts.  How would we combine these to
get an overall opacity score for the word?
Not by simply averaging them; "eldonejo"
is slightly more opaque than "eldoni".  Nor
by adding, because we don't want a score
over 1.0.  Another complicating factor is that
we don't want the presence of both
"eldoni" and "eldonejo" in the lexicon to inflate
the count too much since the latter builds on
the former and is almost transparent if you already
know "eldoni".

Jim Henry