I have spent the last few days trying to get the Vector Space Search engine running. The code is in a bit of a mess at the moment but, it’s comming along. All I can say is thank god for the STL, without it I would been in for a hell of a job. I have now managed to create a Sparse Vector Space Matrix from 2397 documents. This needs to be increased before I can really start testing any weighting algorithms.
At the moment this is using 30Mb of memory. This is the max used during the entire process. I did have it running at 256Mb but this was my first round at designing the matrix. I then showed my program a copy of Knuth volume 3, it cowered in fear and its shoesize quickly dropped to a more respectable size. I am pretty sure that I could drop this even further by writing my own data structure without using the STL but I am happy with it at the moment.
I am not entirely happy with the output of the program yet becasue the inner product routine is not producing the correct output but this should be reletively easy to fix. I really need to do a review to make sure that I am not missing any points in my methodology.
I also need to try and compile a better stop list. The one I am using is not particularly good. This is a sure fire way of reducing the RAM footprint.
83.1 Million links found
10.9 Million unique links found