Print

Print


Jyri, what were the noun and verb with the most definitions?

stevo

On Sat, Feb 14, 2015 at 10:17 PM, Jyri Lehtinen <[log in to unmask]>
wrote:

> 2015-02-13 1:51 GMT+02:00 Siva Kalyan <[log in to unmask]>:
>
> > Interesting idea—and it should be easy to do using (English) WordNet.
> >
> > I suspect polysemy would follow a power law (i.e., lots and lots of words
> > with just one sense, very few with lots of senses).
> >
>
> I went and did exactly that. I took the WordNet 3.1 database (
> http://wordnet.princeton.edu/wordnet/download/current-version/) and
> plotted
> separately the distributions of definitions given per word for their lists
> of nouns, adjectives, adverbs and verbs. Turns out that the distributions
> follow indeed power laws. I then went and did quick fits into the
> dictionary data to get estimates of the power law exponent (alpha) and back
> up the claim of power law behaviour. The results and plots for the
> different word classes are:
>
> Nouns
> 117953 words
> alpha = -3.81 +- 0.02
> http://kirnis.kapsi.fi/kamaa/distr-noun.png
>
> Adjectives
> 21499 words
> alpha = -3.40 +- 0.02
> http://kirnis.kapsi.fi/kamaa/distr-adjective.png
>
> Adverbs
> 4475 words
> alpha = -3.50 +- 0.03
> http://kirnis.kapsi.fi/kamaa/distr-adverb.png
>
> Verbs
> 11540 words
> alpha = -2.72 +- 0.03
> http://kirnis.kapsi.fi/kamaa/distr-verb.png
>
> The fits are done just by simple curve fitting into the histograms and to
> me it looks like they underestimate the error bars of the exponents. I did
> also try a more refined Bayesian regression but couldn't get the sampling
> to behave quickly enough for a short play with the data. Still, it seems
> quite secure to say that the distribution of definitions per word is
> flatter (less negative alpha) for verbs than for the other word classes,
> i.e. verbs have more polysemy. The most definitions given for a noun, for
> example, is 33 while for verbs with ten times less words in the dataset
> it's an incredible 59. I also think that the power law exponent is a good
> contender for a polysemy index as itself, though you could of course also
> derive the mean number of definitions per word from it if you so wish.
>
> This part was quick enough. The hard part, as was already discussed, would
> be to gather more dictionary data and ensure that their editing follows
> comparable standards. You don't want to be comparing compact travel
> dictionaries with thoroughly researched academic ones. It's clear just by
> glancing that they will show very different levels of polysemy for the same
> language. I don't think there would be a full PhD in this but with a bit
> more time with the statistics and a good discussion of comparing different
> datasets you should be able to pull off a good paper.
>
>    -Jyri
>