Iúri Chaer-Master's thesis

My master's research is about the application of computer science to the semantical analysis of Natural Languages. Natural Languages are those languages humans created for communication among themselves — very different from programming languages (used to instruct computers), and from mathematic and logic. Nowadays, with the Internet and the plethora of information available to everyone, Natural Language processing has become a very important and widely researched problem. Search engines like Google try to approach some aspects of it, spellcheckers like Microsoft Word's approach others, but there is still no computational system able to deal with the problem as a whole. My master's dissertation was a modest attempt to improve my understanding of it.

The whole work is based in Solomonoff's Theory of Prediction, a very promising model for learning, along with LSA (Latent Semantical Analysis), a mathematical method for the establishment of correlation between text excerpts. Two systems were implemented: a restricted prototype of the Solomonoff Predictor and a search engine based on LSA. These programs were put under large batches of tests, excessively large to be presented wholly in the dissertation. Thankfuly, the cost of keeping it all online is negligible, and almost everything I've got is available in this website:

The dissertation itself, A study on the Theory of Prediction applied to the semantical analisys of Natural Languages.
The source code for the Solomonoff's Predictor restricted to Natural Languages — the package includes the source code from the SVDLIBC library, which doesn't have an explicit licensing.
My implementation of a search engine based on LSA (Latent Semantic Analysis) — the archive includes the unmodified source code of the libstemmer library, which is under the BSD license.
A few of the graphs plotted from tests with the LSA search engine.

Every program made for this work is under the GPL license. That means you can grab them, use them, change them, but not sell them nor include them or parts of them in commercial programs. I hope they help other computer science students. I'm always open for contributions at iuri@chaer.org.