On Thu, Aug 31, 2006 at 02:20:49PM -0700, Tasci wrote: > Suppose you have a language composed of a discrete, finite set of > syllables. I was considering the ideal way to construct vocabulary for > that language. It depends on your goals. Taxonomic lexicons are not very realistic, and suffer from the overspecification syndrome (see below). But if you're specifically looking to construct a taxonomic system, then you pretty much could do whatever you want. > My idea was to divide all concepts into separate categories, one for > each syllable. Then subcategories would be equally subdivided, and > subsubcategories and so forth. To identify any word in this language, > it would only be a search on an O(k * log(k)(n)) where log(k) is log > base k. That is, you have to know what each letter means, then you > automatically narrow down the word lookup exponentially. It would be > like as if every letter beginning with 'a' were all related somehow, > in a way that all other words are not. Again, it depends on your goals. Associating every letter/syllable with a specific meaning, while superficially attractive, isn't very realistic, and suffers from overspecification. > It sounds like a great strategy, but I've been having problems with > the fact that many concepts we think up are very specific. Horse for > instance. It's a four legged ungulate equiid, an animal mammal that > eats hay, carries people, has a large bottom, its coat is referred to > as hide not fur, it has a mane referred to as hair, as in 'horsehair' > etc etc etc. Just to call a horse a living organism that's a animal > chordate mammal ungulate equiid Equus equs alone would take 7 > syllables. How would I differentiate the horse from the zebra, from > the weasel, from the sea squirt, if I tried to limit it to 4 syllables > of specification? That is, a 4-syllable word for living organism > animal chordate, which is already pretty darn long compared to the 1 > syllable 'horse'. This is what I meant by the overspecification syndrome above. Basically, the reason natural languages are the way they are (seemingly arbitrary assignments of meaning to words), is because our brains filter out white noise, or redundant information. Contrary to initial appearances, terms that refer to similar things are better if they're more *different*. Why? Because a taxonomic scheme makes words that refer to similar things very similar, and you eventually get to the point where your brain is very confused by the minute differences between words. Also, contrary to initial appearances, our brains have no problems picking out the difference between two very similar (or even identical!) words that have different meanings in different contexts. The difference in context makes it very clear that one meaning is meant, and not the other. For example, if you were studying anatomy, then you'd want very specific terms to refer to each part of the body. The temptation is to make these terms taxonomic so that they're easily derivable: say we make 'hujabalu' to mean 'index finger', 'hujabala' to mean 'ring finger', and 'hujabalo' to mean little finger. Furthermore, it happens that 'hujabolu' means 'big toe', etc.. Now imagine a professor lecturing on the anatomy of the hand. 90% of the words in the lecture would be 'hujabal<something>', and after a while, you start to wonder, did he just say 'hujabolu' or was it 'hujabalu', or was it 'hujabalo'? Eventually, the professor himself will probably tire of repeating the same prefix over and over. Now, a computer would have absolutely no trouble picking out the right words, but the human brain doesn't work that way. After a while, you just completely tune out because the words are too similar. Our brains function best when words referring to similar things are overtly different. For example, 'thumb' and 'finger': there's no way to confuse them because they're so different. > What I end up with is an extremely deep and sparse distribution, very > frustrating because a lot of concepts like other non-horse members of > genus Equus, do not even exist! Certainly they're not found in common > conversation. Exactly. This is why, if your goal is to make taxonomic vocabulary, you'll have to be a lot more creative than just making a tree structure out of your syllables. (Or, if your goal is to make a realistic language, you'd probably want a non-taxonomic approach.) > Should I just randomly determine vocabulary? It'd be an even spread, > but it would be a lot harder to remember if xrbtsx is horse and xrblsx > is desk lamp for instance. This is exactly the problem I described above. Now, this doesn't mean taxonomic vocabulary is completely useless, though. It *is* useful sometimes to have *some* semblance of taxonomy, for example, canonical names in biology & chemistry. The trick is to find a balance between having too many words that differ only by a single letter, and having no structure at all. > I had one more idea: that instead of starting with general categories, > I start with specific terms, then generalize. So I could have 'to' > mean horse, and 'tobu' be anything in Equus equs, and 'tobuba' be > anything in the Equiid family, and so forth. Trouble with that is, > which specific concepts get to be the root of all language? Wouldn't > they have to be generalized, by necessity? [...] You're assuming that roots must necessarily be complete general. This doesn't have to be the case, if you have derivational affixes that can suitably modify your roots to whatever concept you may want to signify. One thing to keep in mind, though, is that language is molded to a large extent by utility. A completely general language that has a perfect structure may be very pleasing, aesthetically, but it would be very difficult to use, because you must express everything in terms of the general, abstract structure. In human language, what often happens is that cumbersome expressions that are used frequently get abbreviated over time, and eventually calcify to become a new "root word". Over time, this produces a language that best suits what its speakers express most frequently. The resulting structure may not be the most ideal (you may have many historical cognates whose meanings are no longer related), but it is the most practical. Of course, this doesn't mean language has to be completely arbitrary by necessity. You just need to recognize that our brains operate according to certain principles, and you have to work with that. To summarize, some of the principles are: 1) Frequency of usage is more important than beauty of internal structure. You should cater to the fact that the most frequently used words should be most economical, even if the concepts themselves are very complicated and require a lot of specification in a taxonomic system. What exactly constitutes 'frequent' depends on what your target audience is. The language of a farmer is very different from the language of an academic researcher, even though they share a similar subset that they can mutually understand each other (e.g., they both speak English---even though the kind of English the researcher speaks uses a lot of words that the farmer doesn't use from day to day, and vice versa). 2) Words that refer to similar things in the same context preferably should be as different as possible. Our brains work largely by context; therefore, given the same context (e.g. a particular category of things), you want to make the words as different as possible. 3) Words that refer to different things in different contexts don't have to be very different from each other. Our brains can easily tell from context which meaning is intended, so there's no need to split hairs in this area. These principles may sound counterintuitive at first glance, but if you think about it, this really makes more sense than a naive, top-to-bottom taxonomic system. However, it may still be possible to have a taxonomic system that are compatible with these principles; you just need to be creative about how exactly you represent the taxonomic structure. My advice is, a simple mapping from taxonomic structure to syllables is impractical. T -- People tell me that I'm skeptical, but I don't believe it.