Print

Print


On Wed, Oct 27, 2010 at 6:42 PM, Arthaey Angosii <[log in to unmask]> wrote:
> Now, who has a good source of several different natlang wordlists? :)

This script, fed an etext in the target language, will produce a list
of unique words occurring in it.  It needs work to handle etexts in
UTF-8, but seems to handle Latin-1 fine.

#!/usr/bin/perl -w

use strict;

my %words;

while ( <> ) {
    my @words_this_line = split m/[^a-zA-Z\x80-\xFF'-]/;
    foreach ( @words_this_line ) {
	s/^['-]+//;
	s/['-]+$//;
	$words{ $_ } = 1;
    }
}

foreach ( sort keys %words ) {
    print $_ . "\n";
}

-- 
Jim Henry
http://www.pobox.com/~jimhenry/