After much hunting for a library that I can use to impliment an LSI search engine I have had little luck. The library that seems to be the job for this is called SVDPACK it is written in Fortran and has been ported to C++. However, it has not been ported to the humble x86 architecture. It looks like I will have to run with writing the Vector Space search engine instead.
I managed to write a C++ routine to get the output of the Perl Term Document parser. This is a very simple parser that splits all the words in the document on whitespace. I know that there are reasons for not doing it like this so if I get time I will come up with a better method later but for now it will do.
My next task now is to take the input of the C++ program and create a Term Document Matrix from that that I can manipulate easily. I need to be able to carry out the following actions and quite a few more.
1. Count all occourances of each word in the entire document set.
2. Calculate mean values for each word.
3. Come up with some method to ran words in the document matrix. This is to avoid the typical abuses that you see where website saturate pages with keywords to try and manipulate the results of a search engine.
67.6 Million links found
8.529 Million unique links found
program sparse matrix in c++