Print

Print


Den 2015-11-20 kl. 01:35, skrev Jim Henry:
>> Everyword:
>> >Offline (Perl), command-line / DSL-based. Unicode support is unknown.
>> >Not technically a random word generator. Everyword generates all
>> >possible words for a finite, fixed-length phonology (no repetitions or

> Benct has an improved version of this called wordgen.  I'm not sure if
> it's available from his website or if he emailed it to me.

I emailed an earlier version to you. It is not really an improved
version of everyword since unlike everyword it doesn't generate
every possible word from a set of syllable structures. The DSL of
the first version was partly inspired by that of everyword but
that has changed in later versions. The resent word generator
discussion has prompted me to do a long overdue rewrite which
removes the requirement to have whitespace around all tokens
including parentheses and brackets. I believe the version you saw
didn't have the weighted list shortcut syntax yet. Recent
versions also have conditionals by regular expression (abort
current 'word' unless it matches/not matches a pattern, execute
one subrule if a pattern [doesn't] match and another otherwise.
And fixup by regular expression substitution. It will also remove
the use of string eval (a security risk) to handle escapes and
backreferences in double-quoted strings and substitution
replacement strings -- in favor of String::Formatter
(https://metacpan.org/pod/String::Formatter) format strings -- so
that I dare publicize it. A side effect of the use of string eval
was support for entering Unicode characters by codepoint, name or
user-defined alias. I'm keeping a(n other) mechanism for that
even though I don't need it as much anymore. The price of course
is a heavier program with dependencies. It will require perl 5.14
but that's less of a problem nowadays -- certainly in itself no
reason to not publicize it. A real reason to hold back may be its
primitive mechanism compared to randomization according to Zipf's
law and mathemagical refinements. I do believe however that it
will -- apart from the regular expression stuff which isn't
essential -- be easy to use for non-programmers. The DSL is
declarative and was designed to resemble a syllable structure
formula. A program may look like this:

     == Options
     # Defaults
     --toprule       word
     --count         7
     --unique
     --interactive
     --normalize     nfd

     == Aliases

     umlaut = 'COMBINING DIAERESIS BELOW'
     ring   = 'COMBINING RING BELOW'

     == Rules

     &onset      := [    # .2 etc. denote weight
                     [ p.3 t.4 tz tr c k.2 ].4
                     [ ph.3 th.4 ts thr ch kh.2 ]
                     [ b.3 d.4 dz dr j g.2 ].3
                     [ m3. n.4 ñ ng.2 ].2
                     [ s.4 shr sh.2 h.3 ]
                     [ r y w ]
                     ()      # null
                 ]
     # A consonant or nothing

     &medial     := ( [ r y w () ]
                     # y, w or nothing, with nothing rather likely.
                     !m{ [pbm] h? w | ([td][zs]? | [cs]h? 
|n|ñ|ng|r|y) y }x )
                     # No labial + labial, no y after 
dentals/retros/palatals

     &nucleus 	:= (
                [ ([ ( [a e i o u] <<@umlaut>> ).2 ḁ ].2) : [r z] ]
     # a aa ḁ ḁḁ a̤ a̤a̤ e ee i ii o oo o̤ o̤o̤ u uu ṳ ṳṳ r z,
     # with less ḁ a̤ o̤ ṳ r z,
                     !m/[ei]%{N:umlaut}|\S(yi|wu)|rr|zz)/
                     # no e̤ or i̤, no Cyi/Cwu/rr/zz
                 )

     &tone     := [ s/[aeiourz][%{N:umlaut,ring}]?\K$/%{N:acute}/ ]
     # optional high tone with predefined combining diacritic alias

     &glide      := { i o s/oo/ou/ !m/[ou]%{N:umlaut}o/ }
     # falling di/triphthongs exist but ii oeo ueo are out

     &coda       := { [p t k r] m n.2 ng }
     # optional coda more often nasal than stop/r and most often n

     &syllable   := ( &onset &medial &nucleus &tone &glide )
     # tone optional!

     &word       := ( &syllable+3
     # A word has one to three syllables
                         s/p([bdjg])/b%{s:1}/g
                         s/t(?<onset>[bdjg])/d%{s:onset}/g
                         s/k(?=[bdjg])/g/g   # best method
                         # codas are voiced before voiced 
stops/affricates
                         !m/([^aeiou])\1|dj|tc|tsh/
                         # there are no geminates, no long 
syllabic r or z
                     )
     &name       := ( ~&word " " ~&word )
     # two capitalized words separated by a space

This doesn't illustrate branching conditionals because I couldn't 
come up with anything non-contrived. The syntax is `( m/.../ ? 
(TRUE) | (FALSE) )`, where the condition may be negated with `!` 
and the use of `|` rather than `:` is because `a:` means "`a` or 
`aa`".

Currently the meaning of the bracketed (sub)rules is like this:

                 plain   optional
     list/group  (...)   <...>
     alternative [...]   {...}

Note that `<...>` could be written `[ (...) () ]` or `(...)*` and 
`{...}` could be written `[ [...] () ]` or `[...]*`. Perhaps I 
should drop the optionality brackets (right column) in favor of 
the other syntaxes but then I would lose the possibility to 
express "very optional" with `<<...>>` or `<{...}>` which looks 
kind of elegant. Or I could have only three kinds of 
brackets/(sub)rules, viz.

     list/group  (...)
     alternative [...]
     optional    {...}

That would free `< >` for conditional tests checking the length of 
the generated string, which would be useful.

Thoughts, anybody?

/bpj