On 21 December 2012 08:03, Michael Everson <[log in to unmask]> wrote: > On 21 Dec 2012, at 14:26, MorphemeAddict <[log in to unmask]> wrote: > >> Given abundant written text in a language you don't know, how much can you determine about that language, using only the texts themselves (I.e., no reference grammars, no dictionaries, etc.)? Can its grammar be deduced? The meaning of any of its words? How would you go about determining them? > > Unless there are illustrations, pretty much all you can work out is the number system and maybe some of the entities which have been counted. > > Cf. Linear A. Yup. This sort of thing has been tried before. There's also, e.g., the Voynich manuscript (though we don't actually know for sure if that even is a language, as opposed to a code or nonsense), proto-Elamite (in which there was recently a significant breakthrough: http://www.bbc.co.uk/news/business-19964786), and Egyptian prior to the Rosetta stone (though we can probably get farther than was gotten on Egyptian with modern moethods). One can get farther if you're lucky enough to be able to bring more external knowledge to bear. E.g., if we found an unknown Romance language, it would probably be pretty easy to recognize its strong similarity to other Romance languages and work from there. The nature of the script makes a big difference, too. If it's a writing system we already know, that obviously helps; an unknown writing system itself may make the language inaccessible if it obscures important features, and in any case that's just one more level of complexity to deal with. If the writing is unbroken (like Chinese, Japanese, Ancient Greek, and my handwriting), then you might not even being able to identify what a word is (Egyptian provided some help in the form of cartouches), in which case you're stuck with character-level analyses and you can at least hope that numerals have a significantly different distribution than other characters so you can figure out the number system. If you can identify words, then you can potentially get a lot farther- any of the bag-of-words methods from statistical natural language processing are then at your disposal, along with collocation / n-gram analyses. That means you could do document clustering (i.e., "we don't what the topics actually are, but these two documents are probably about the same topic"), maybe some word sense disambiguation (i.e., "the distribution of this word is highyl bimodal, so it probably has two meanings," though I wouldn't put much faith in it's accuracy). You could then guess at what possible topics might be based on external information like where the texts were found. You might be able to extract some syntax rules by grouping words into classes based on similar distributional properties (this would take a truly enormous amount of data to do well, and involves an iterative clustering process since the observed environment of one word that determines its class depends on the classes assumed for other words that occur near it). That won't get you any closer to deciphering meaning, though. If that works, you might hope to notice similarities with other languages you know about, and try to apply comparative methods. -l.