On Sat, 21 Nov 2015 11:17:17 -0700, Logan Kearsley <[log in to unmask]> wrote:

>On 20 November 2015 at 18:57, Alex Fink <[log in to unmask]> wrote:
>> A couple recent thoughts.
>> (0) The part of And's spec that had troubled me the most, in terms of it not being clear straightaway what the Correct thing to do was, was how to fuzz the distributions, i.e. how to do what And specifies above in terms of peak and inflection point.  Well, probably we won't know how to do this in a nature-approximating way until the Gusein and Zade of this research area come along.  But implementationally, it's occurred to me that all the fuzzing can be built into the word grammar, so the program doesn't have to do it separately.
>> For instance, to take a simple example, if one wants to be able to ask for a word of given maximum complexity but accept a word of up to say two complexity points less, with no other biassing, one can do this by including a new terminal in the word grammar, appearing exactly once in each word, which expands equiprobably to segmental zero with complexity zero, segmental zero with complexity one, and segmental zero with complexity two.  Giving a different probability distribution there clearly lets you make the fuzzing fall of at different rates.  Or if you wanted the amount of fuzz to increase proportionally to the word length: no problem, don't just make this terminal appear once in each word, but let there be a copy of it as sister to every terminal with segmental substance.  And so on.
>> This works cleanly in And's case of discrete complexity scores.  To do the Correct thing in Jim's case of basically-continuous complexity scores you'd need to be able to specify non-discrete distributions.
>That seems inelegant to me. It would result in the same word appearing
>multiple times in the output with different complexity scores, unless
>there were additional post-processing to remove duplicates and keep
>the minimum or something like that.

But the same word already has the potential to "appear multiple times in the output" if the word grammar is ambiguous, which a priori we certainly shouldn't assume it isn't ('cause converting a grammar to an unambiguous one is something we shouldn't force on the user).  Is that a problem?  As regards different complexity scores, well, I suppose I was already thinking in the mindset of my (1), so that a word just has one complexity polynomial rather than several complexity scores.

At any rate, having to hardcode in the sorts of distributions one wants complexity to fall off along seems at least as inelegant, to me.

>I'd simply do an incremental greedy search. [...]

Hm.  In Jundian the complexities are discrete and small (each phoneme scores from 1 to 5, inclusive?) and the desired behaviour is to pick one word from the large collection which have complexity (say) 13, so there'll be a whole lot which are "closest", and we don't just want to get the same one every time.  I guess that once you have the list you can pick from them, but I don't see right away how to mesh your description with allowing explicit probabilities for the elements, and in general straightforward greedy search like this is hard to tailor to a given probability distribution.

What I'd do for this in practice is first precompute a table listing, for each category in the word grammar and each complexity score up to whatever limit, the total probability mass of expansions of that category with that complexity (easy to do by a memoised recurrence).  Then random expansions conditioned on the complexity are easy to do: for each replacement made, apportion complexity points between the symbols in the right hand side of the rewrite with probabilites drawn from the table, and recurse down.

>> (1) This is more of a theoretical unification.  Take an extant word generator which allows specification of probability distributions, and replace the real numbers in which the probabilities live with the univariate polynomial ring R[t], where the "probability" p*t^n is to be understood as meaning that this option has probability p and incurs complexity score n.  The effect of this replacement is basically to piggyback the complexity-score computations on top of the probability computations that the word generator is implicitly already doing; the complexity scores will behave the right way via this piggybacking.
>> (This has an indiscrete analogue too: replace polynomials with distributions on the real line and product with convolution.)
>> I don't imagine this idea would be plug-and-play into any of the extant word generators: even if you'd had the inexplicable foresight to let probabilities be of a generic data-type, you'd still need some special stuff to generate words of a complexity the user selects (namely, some kind of coefficient extraction).  But a tool that can explicitly do the probabilistic analogue of what does is very nearly there.
>Not quite plug-and-play, but it would be fairly simple to add that
>capacity to Logopoeist. I don't currently plan on pursuing that
>feature myself, but it is open source and I'll accept pull

That's another one for the "copious free time" files (I'm in awe of the amount you get done so quickly!)  And I'd have to have a good hard think how to make it mesh with your whole conditioning perspective first.