On Sun, 22 Apr 2007 18:47:58 -0400, Jim Henry <[log in to unmask]> wrote:

>On 4/22/07, Henrik Theiling <[log in to unmask]> wrote:
>> So your main criterion would be predictability of semantics?  If
>> predictable => no new word, if not predictable => new word.  This
>> seems, well, very reasonable for composing a lexicon.  Of course there
>> will be difficult cases, but let's ignore them for now.
>> This means that for counting a conlang's words, we probably should:
>>   - also count phrases ('bubble sort algorithm') and idioms
>>   - not count lexicon entries that are due to irregular forms
>>     ('saw' cf. 'see')
>>   - count polysynthetically constructed words several times,
>>     excluding structures that are semantically clear operations,
>>     but counting all irregularly derived concepts

What you're proposing to count there seem to be essentially _listemes_
[Wiktionary def: (linguistics) An item that is memorized as part of a list,
as opposed to being generated by a rule.], except that suppletive and
'irregular' forms do count as listemes.  But, afaik, it's debatable whether
strong verbs are really irregular in the relevant sense.  

That's a perfectly reasonable criterion for counting, as I see it.  In
particular it correlates pretty closely to the amount of work the conlanger
will have had to put in to designing the lexicon: each listeme requires
specification somewhere, but regularly rule-derived forms don't need to be

>Of course in starting a new lexicon for a new language one
>could easily have a field for "semantic transparency",
>or perhaps an integral field indicating how many words
>(or "lexical items") each entry counts for (1 for root words
>and opaque compounds, 0 for irregular forms and transparent compounds;
>1 for idioms and stock phrases?).
>On the other hand, transparency/opacity is a
>continuous rather than a boolean quality.  Some
>"transparent" compounds are more tranparent
>than others, some "opaque" compounds are more
>opaque than others; and the same is true of
>idiomatic phrases.  So maybe the semantic transparency
>field gets real numbers ranging from 0.0 to 1.0, and
>the overall word count for the language would probably
>be non-integral.
>On the gripping hand, maybe the "semantic transparency"
>needs to be applied at the morpheme boundary level
>rather than the word level.  For instance, in E-o
>"el-don-ej-o" there are three morpheme boundaries,
>one perfectly transparent (ej-o), one somewhat
>transparent (between el-don and -ej), and one
>almost completely opaque (el-don).  We might
>assign them transparency (or rather opacity)
>scores of
>el-  don -ej -o
>  0.95, 0.20, 0.0
>or thereabouts.  How would we combine these to
>get an overall opacity score for the word?
>Not by simply averaging them; "eldonejo"
>is slightly more opaque than "eldoni".  Nor
>by adding, because we don't want a score
>over 1.0.  Another complicating factor is that
>we don't want the presence of both
>"eldoni" and "eldonejo" in the lexicon to inflate
>the count too much since the latter builds on
>the former and is almost transparent if you already
>know "eldoni".

What's the problem here?  Only the outermost opacity should count, if you
assume the branching is binary so that there is an outermost derivational
operation.  In this case I gather the base of <eldonejo> is <eldoni>; so
<eldon-> counts for 0.95 of a lexical item, <eldonej-> for 0.2, and
<eldonejo> for none (if you reckon it in your count at all, which is a moot

Overall, though, I like this idea of non-integral counting, making opacity ~
compositionality of a derivation, or listemicity of an item, a fuzzy
concept.  Now if only there were some way to systematically make statements
like "the opacity of the derivation 'speak' > '(loud)speaker' is 0.6931"...