On Thu, Aug 31, 2006 at 02:20:49PM -0700, Tasci wrote:
> Suppose you have a language composed of a discrete, finite set of
> syllables. I was considering the ideal way to construct vocabulary for
> that language.

It depends on your goals. Taxonomic lexicons are not very realistic, and
suffer from the overspecification syndrome (see below). But if you're
specifically looking to construct a taxonomic system, then you pretty
much could do whatever you want.

> My idea was to divide all concepts into separate categories, one for
> each syllable. Then subcategories would be equally subdivided, and
> subsubcategories and so forth. To identify any word in this language,
> it would only be a search on an O(k * log(k)(n)) where log(k) is log
> base k.  That is, you have to know what each letter means, then you
> automatically narrow down the word lookup exponentially. It would be
> like as if every letter beginning with 'a' were all related somehow,
> in a way that all other words are not.

Again, it depends on your goals. Associating every letter/syllable with
a specific meaning, while superficially attractive, isn't very
realistic, and suffers from overspecification.

> It sounds like a great strategy, but I've been having problems with
> the fact that many concepts we think up are very specific.  Horse for
> instance. It's a four legged ungulate equiid, an animal mammal that
> eats hay, carries people, has a large bottom, its coat is referred to
> as hide not fur, it has a mane referred to as hair, as in 'horsehair'
> etc etc etc.  Just to call a horse a living organism that's a animal
> chordate mammal ungulate equiid Equus equs alone would take 7
> syllables.  How would I differentiate the horse from the zebra, from
> the weasel, from the sea squirt, if I tried to limit it to 4 syllables
> of specification?  That is, a 4-syllable word for living organism
> animal chordate, which is already pretty darn long compared to the 1
> syllable 'horse'.

This is what I meant by the overspecification syndrome above. Basically,
the reason natural languages are the way they are (seemingly arbitrary
assignments of meaning to words), is because our brains filter out white
noise, or redundant information. Contrary to initial appearances, terms
that refer to similar things are better if they're more *different*.
Why? Because a taxonomic scheme makes words that refer to similar things
very similar, and you eventually get to the point where your brain is
very confused by the minute differences between words.

Also, contrary to initial appearances, our brains have no problems
picking out the difference between two very similar (or even identical!)
words that have different meanings in different contexts. The difference
in context makes it very clear that one meaning is meant, and not the

For example, if you were studying anatomy, then you'd want very specific
terms to refer to each part of the body. The temptation is to make these
terms taxonomic so that they're easily derivable: say we make 'hujabalu'
to mean 'index finger', 'hujabala' to mean 'ring finger', and 'hujabalo'
to mean little finger. Furthermore, it happens that 'hujabolu' means
'big toe', etc.. Now imagine a professor lecturing on the anatomy of the
hand. 90% of the words in the lecture would be 'hujabal<something>', and
after a while, you start to wonder, did he just say 'hujabolu' or was it
'hujabalu', or was it 'hujabalo'? Eventually, the professor himself will
probably tire of repeating the same prefix over and over.

Now, a computer would have absolutely no trouble picking out the right
words, but the human brain doesn't work that way. After a while, you
just completely tune out because the words are too similar. Our brains
function best when words referring to similar things are overtly
different. For example, 'thumb' and 'finger': there's no way to confuse
them because they're so different.

> What I end up with is an extremely deep and sparse distribution, very
> frustrating because a lot of concepts like other non-horse members of
> genus Equus, do not even exist! Certainly they're not found in common
> conversation.

Exactly. This is why, if your goal is to make taxonomic vocabulary,
you'll have to be a lot more creative than just making a tree structure
out of your syllables. (Or, if your goal is to make a realistic
language, you'd probably want a non-taxonomic approach.)

> Should I just randomly determine vocabulary? It'd be an even spread,
> but it would be a lot harder to remember if xrbtsx is horse and xrblsx
> is desk lamp for instance.

This is exactly the problem I described above.

Now, this doesn't mean taxonomic vocabulary is completely useless,
though. It *is* useful sometimes to have *some* semblance of taxonomy,
for example, canonical names in biology & chemistry. The trick is to
find a balance between having too many words that differ only by a
single letter, and having no structure at all.

> I had one more idea: that instead of starting with general categories,
> I start with specific terms, then generalize.  So I could have 'to'
> mean horse, and 'tobu' be anything in Equus equs, and 'tobuba' be
> anything in the Equiid family, and so forth.  Trouble with that is,
> which specific concepts get to be the root of all language? Wouldn't
> they have to be generalized, by necessity?

You're assuming that roots must necessarily be complete general. This
doesn't have to be the case, if you have derivational affixes that can
suitably modify your roots to whatever concept you may want to signify.

One thing to keep in mind, though, is that language is molded to a large
extent by utility. A completely general language that has a perfect
structure may be very pleasing, aesthetically, but it would be very
difficult to use, because you must express everything in terms of the
general, abstract structure. In human language, what often happens is
that cumbersome expressions that are used frequently get abbreviated
over time, and eventually calcify to become a new "root word". Over
time, this produces a language that best suits what its speakers express
most frequently.  The resulting structure may not be the most ideal (you
may have many historical cognates whose meanings are no longer related),
but it is the most practical.

Of course, this doesn't mean language has to be completely arbitrary by
necessity. You just need to recognize that our brains operate according
to certain principles, and you have to work with that. To summarize,
some of the principles are:

1) Frequency of usage is more important than beauty of internal
   structure. You should cater to the fact that the most frequently used
   words should be most economical, even if the concepts themselves are
   very complicated and require a lot of specification in a taxonomic
   system. What exactly constitutes 'frequent' depends on what your
   target audience is. The language of a farmer is very different from
   the language of an academic researcher, even though they share a
   similar subset that they can mutually understand each other (e.g.,
   they both speak English---even though the kind of English the
   researcher speaks uses a lot of words that the farmer doesn't use
   from day to day, and vice versa).

2) Words that refer to similar things in the same context preferably
   should be as different as possible. Our brains work largely by
   context; therefore, given the same context (e.g. a particular
   category of things), you want to make the words as different as

3) Words that refer to different things in different contexts don't have
   to be very different from each other. Our brains can easily tell from
   context which meaning is intended, so there's no need to split hairs
   in this area.

These principles may sound counterintuitive at first glance, but if you
think about it, this really makes more sense than a naive, top-to-bottom
taxonomic system. However, it may still be possible to have a taxonomic
system that are compatible with these principles; you just need to be
creative about how exactly you represent the taxonomic structure. My
advice is, a simple mapping from taxonomic structure to syllables is


People tell me that I'm skeptical, but I don't believe it.