Print

Print


Is there a corpus of phoneticized English (not necessarily spoken) similar
to Brown's?

stevo


On Sun, Dec 22, 2013 at 3:10 PM, Jim Henry <[log in to unmask]> wrote:

> On Sun, Dec 22, 2013 at 8:40 AM, Tristan <[log in to unmask]> wrote:
> >> Would English be more difficult to decipher (in a cryptogram, e.g.) if
> >> it were originally enciphered with the articles prefixed to the
>
> > Yes, but not very much. If the decipherer didn't know it was the case, it
> > would be slightly more effective. You can remove every spaces and not
>
> As an experiment, I stripped every instance of "the", "a" and "an"
> from a million-word etext (Macaulays' History of England) and did
> before and after character frequencies.  (These are frequencies of all
> characters including space and punctuation, but only spaces and
> letters show up in the top 10.)
>
> ==> with articles <==
>  984592        15.80%
>  655099        10.51%        e
>  473350        7.59%        t
>  391179        6.28%        a
>  379825        6.09%        o
>  358633        5.75%        n
>  336552        5.40%        i
>  321708        5.16%        h
>  319893        5.13%        s
>  303685        4.87%        r
>
> ==> without articles <==
>  984592        16.57%
>  568683        9.57%        e
>  386934        6.51%        t
>  379825        6.39%        o
>  364927        6.14%        a
>  354965        5.97%        n
>  336552        5.66%        i
>  319893        5.38%        s
>  303685        5.11%        r
>  235292        3.96%        h
>
> The relative frequencies of 'a' and 'o' reverse, but are very similar
> either way.  'h' drops a few ranks.  Relative frequencies of the big
> 'e' and 't' are still the same.
>
> Replacing all instances of 'th' (in 'these', 'bath' etc.) with 'z' has
> a bigger impact on relative frequencies, dropping 't' and 'h' by
> several ranks:
>
>  984592        16.21%
>  655099        10.79%        e
>  391179        6.44%        a
>  379825        6.25%        o
>  358633        5.90%        n
>  336552        5.54%        i
>  319893        5.27%        s
>  314159        5.17%        t
>  303685        5.00%        r
>  222909        3.67%        d
>
> Still, not a huge impact on other relative letter frequencies or the
> difficulty of simple cryptograms.  And if you're using a cipher that's
> vulnerable to that level of attack for anything more serious than an
> espionage RPG, you're in trouble anyway.
>
> --
> Jim Henry
> http://www.pobox.com/~jimhenry/
> http://www.jimhenrymedicaltrust.org
>