On 21 December 2012 08:03, Michael Everson <[log in to unmask]> wrote:
> On 21 Dec 2012, at 14:26, MorphemeAddict <[log in to unmask]> wrote:
>> Given abundant written text in a language you don't know, how much can you determine about that language, using only the texts themselves (I.e., no reference grammars, no dictionaries, etc.)? Can its grammar be deduced? The meaning of any of its words? How would you go about determining them?
> Unless there are illustrations, pretty much all you can work out is the number system and maybe some of the entities which have been counted.
> Cf. Linear A.

Yup. This sort of thing has been tried before. There's also, e.g., the
Voynich manuscript (though we don't actually know for sure if that
even is a language, as opposed to a code or nonsense), proto-Elamite
(in which there was recently a significant breakthrough:, and Egyptian prior to
the Rosetta stone (though we can probably get farther than was gotten
on Egyptian with modern moethods).

One can get farther if you're lucky enough to be able to bring more
external knowledge to bear. E.g., if we found an unknown Romance
language, it would probably be pretty easy to recognize its strong
similarity to other Romance languages and work from there.

The nature of the script makes a big difference, too. If it's a
writing system we already know, that obviously helps; an unknown
writing system itself may make the language inaccessible if it
obscures important features, and in any case that's just one more
level of complexity to deal with. If the writing is unbroken (like
Chinese, Japanese, Ancient Greek, and my handwriting), then you might
not even being able to identify what a word is (Egyptian provided some
help in the form of cartouches), in which case you're stuck with
character-level analyses and you can at least hope that numerals have
a significantly different distribution than other characters so you
can figure out the number system.

If you can identify words, then you can potentially get a lot farther-
any of the bag-of-words methods from statistical natural language
processing are then at your disposal, along with collocation / n-gram
analyses. That means you could do document clustering (i.e., "we don't
what the topics actually are, but these two documents are probably
about the same topic"), maybe some word sense disambiguation (i.e.,
"the distribution of this word is highyl bimodal, so it probably has
two meanings," though I wouldn't put much faith in it's accuracy). You
could then guess at what possible topics might be based on external
information like where the texts were found. You might be able to
extract some syntax rules by grouping words into classes based on
similar distributional properties (this would take a truly enormous
amount of data to do well, and involves an iterative clustering
process since the observed environment of one word that determines its
class depends on the classes assumed for other words that occur near
it). That won't get you any closer to deciphering meaning, though. If
that works, you might hope to notice similarities with other languages
you know about, and try to apply comparative methods.