Print

Print


On 13 November 2015 at 02:49, Shreyas Sampat <[log in to unmask]> wrote:
> I've been playing with this tool for a couple of hours now and it's pretty
> great, so thanks for developing and sharing. If I may make a suggestion?

Absolutely!
Glad to hear you like it; I do plan to keep trying to make it better,
so stay tuned for further updates!

> Right now Logopoeist has a strange behavior where it will cluster output
> lengths and at times it will generate extremely long outputs, like the
> following:
>
> Run 1:
[snipped]

> Ok good so far, right? It's a little odd that we only got exactly
> one-syllable and exactly three-syllable outputs, but maybe that's just a
> statistical oddity.
>
>
> Run 2 (same grammar):
>
> blonékukhgyuklatapujhkhrutphlotaudrophjratyozrájidokúkhyasusyatusaphdnechmechpyéeonuvrosípyichyú
[snipped]

> What happened here? Now I'm not actually going to count vowel symbols but
> I'd bet my buttons that we are looking at 6 output words of the same
> syllable length again, only this time they're crazy long. Also note that
> all of the monosyllable outputs are of the form CCV. If we go back to the
> first output set, you see the same behavior (two word templates are being
> alternated between)
>
> Anyway this behavior struck me as curious and it made me think, perhaps
> this tool would benefit from an output length limiter, and maybe whatever
> mechanism is selecting templates to populate needs a bit of examining
> because it seems like it's getting stuck on a very small number of
> templates now.

Indeed! I noticed some similar behavior, but nothing quite as extreme
as that! It could just be bad luck with what comes out of the random
number generator, but that does seem unlikely. I think I shall try
generating running some really long output sets, and doing some
statistical tests on them to see if maybe there is something wrong
with the distribution selection algorithm.

I am pondering adding commands for minimum and maximum output lengths,
but there are some annoying problems to be solved before that is
practical.
The most significant problem is ensuring that limits are satisfiable-
i.e., if you specify a maximum length, can the software ensure that
the grammar actually permits words shorter than that maximum? Most of
the time, that really shouldn't be an issue, but if you end up making
a mistake (and everybody makes mistakes, even if they're just typos),
it would be really bad for the program to just sit there spinning
forever, outputting nothing, while trying to satisfy impossible
demands, rather than telling you that something is wrong and needs to
be fixed.

I could implement min and max lengths and just add a warning to the
documentation that it may lock up the program, so use at your own
risk... but then there're still *efficiency* concerns. The obvious way
to do it is to just generate lots of words, but discard any that don't
meet the limits and don't bother showing them, but that's gonna eat up
a lot of CPU time, which just makes me feel gross. There are ways to
optimize it, but I really can't think of anything that doesn't
involve, at some point, generating random stuff that you just throw
away- which means there's always a small chance that the computer sits
there for an arbitrarily long time before outputting anything.

(If anyone else has ideas about how to resolve those problems, though,
I would be happy to hear them!)

To get around all that, however, it is possible to define grammars
that have maximum lengths. The downside of that approach is that it is
really painful to do so, but I've been thinking of some modifications
that would make it a lot easier.

First off, I'm thinking of extending the syntax for describing
grammars so that it uses some ideas from regular expressions and looks
a little more like what most people are used to for describing
phonotactics- i.e., surrounding things with parentheses to indicate an
optional element, like in "CV(C)" (a fairly common syllable structure
template), and using regex symbols like Kleene star, +, and {curly
braces} to indicate that certain elements can be repeated a bounded
number of times (0 or more, one or more, or between arbitrary fixed
upper and lower bounds, respectively). I still need to figure out a
good way to specify distributions for those kinds of options, though.
Perhaps a first pass will just use a built-in distribution, and more
user-accessible configuration options can be added later.

Second, I was thinking you could specify upper and lower bounds for
the counts of specific word-syntax elements, so that you could, for
example, specify that you want words with a specific range of numbers
of syllables. That feels simpler, and closer to what the average user
would probably want than just limiting raw character counts, but
unfortunately does still have a lot of the same verifiability problems
as maximum and minimum word lengths.

-l.