The last couple of days have been spent sorting out some of the perl and C++. I have also expanded the stop list quite a bit. The Perl script that I was using to produce the file to build the term document matrix also got a bit of a working over.
I have increased the document list to 15700 which is still relatively small for an internet search engine but it is now a respectable amount of text to search for a small intranet site, like a small law firm. I will gradually increase this as I go along and testing to see what kind of results I get.
I have decided to write up what I have done with example code and put it on another few pages. Hopefully someone will be able to make some use of it.
Please see my:
Vector Space Search Engine
page for more details of what I have done to get this working, I am using the term working in its weakest sense here since I have been unable to test it properly yet.
83.1 Million links found
10.9 Million unique links found