On Tue, 10 Nov 2015 10:09:13 -0700, Logan Kearsley <[log in to unmask]> wrote:

>For a good general-purpose, re-usable word generation tool, we should
>really be using conditional distribution models, as are used in
>real-world NLP systems. Basically, you specify not just what total
>frequency of each phoneme you want, but what frequency each phoneme
>should have *in a certain environment*, thus simultaneously encoding
>both phonotactics and frequency information.

Interesting point, I hadn't put those thoughts together before.

Conlangers are used to thinking of the difference between _zero_ and _nonzero_ probabilities of certain sequences, for those are phonotactic restrictions.  As such, tools like Jim et alius' Boris and Rosenfelder's gen and William's recent lexifer have capabilities for rewrite rules, e.g. "change every velar before [i] to a palatal".  Maybe the easiest way to build on this would therefore be to probabilise these rules: just let the user stick a probability on them so as to make them do e.g. "change velars before [i] to palatals 40% of the time".  

Incidentally, my phonology generator Gleb actually was *almost* able to end up with nontrivial conditional distributions in its wordform generators in a sort of emergent way, even though it generates phonemes unconditionally.  The mechanism would have been this.  Suppose *A, *B, *C are underlying feature-bundles.  It is within Gleb's power to do the following: it might not generate any unconditional rules changing *A or *B (so that they become licit surface phonemes) but for *A might generate an ordered list of rules with first
  *C > *A in context X
and later
  *C > *B unconditionally.
This would result in more *As and fewer *Bs in context X than elsewhere (but no absolute cooccurrence restrictions, since underlying *As and *Bs could also be generated in each place).  In fact, returning to the realis mood, I coded it so this such orderings didn't happen basically as a side effect of arriving at a canonical phonemisation: putting all the conditional rules before all the unconditional ones means that you can run the ordered list of rules partway, spit out the result between //, run it the rest of the way, and then spit out the result between [].  But now that you bring this up, I see that was actually an unfortunate design choice in a way!

>I see two ways to fix that [combinatorial explosion]:
>First, you provide reasonable defaults, so that not everything has to
>be explicitly written down. E.g., "anything that isn't explicitly
>written down has zero probability" might be a good default rule.

I have my doubts that it would be good.  People wouldn't write nearly enough down, so I'd expect the results to be like a Markov model trained on too small a data set: the only outputs it'll produce are inputs, or occasionally blends of two inputs which happen to mesh over a substring.

>Second, you take into account explicit, generalized phonotactic rules
>*first*, and then apply the statistical model within those
>constraints. Which is what I'm doing with Canyonese/Amalishke.

Another way would be to try to do it interactively, with a larger or smaller dollop of machine learning.  Give the user a bunch of output words, and let them say if there are any they don't like (or especially like), and perhaps what they don't (or do) like about them -- perhaps they can highlight a substring of the word that's the unpleasant (or especially pleasant) part.  Take that information and backpropagate, de(/in)creasing the conditional probabilities of sequences "like" the highlighted one, and iterate with another presentation of a list of words.  The trouble is of course deciding what "like" means, which of the features of the sounds in question are the relevant ones.

>If there's a lot of interest, I might actually go ahead and try to
>build something like this. Other people's input on what would make for
>a good, easy-to-use configuration format would also be welcome.