I have started on the Semantic Search Engine. I have downloaded 250Mb of pages to test with. I then constructed a partial (test) word list from this. The word list has the term frequency occourance and originating doc id. The words have all been stemmed to reduce overhead, I used the Lingua::Stem module for this. I will create a full word list tomorrow if I get time. I also need to find a decent library in C++ becasue I don’t fancy writing my own Singular Value Decomposition library (if you know what sort of maths would be involved in doing this you also know that I am not at that level, yet! ;-). I also think Perl may be a bit slow for what I am trying to do although I am always willing to give it a try and see what happens.
67.6 Million links found
8.529 Million unique links found