I just noticed a few sites that offer link checking on a commercial basis. Does anyone know if there is any money in this because it is a surprisingly easy thing to do. In one month I managed to check the links on over 500,000 pages and produce error codes for every link. This is from a very humble machine and across the internet. If I was on an intranet with a 100Mb connection onto the webserver I could trawl an entire site at quiet times in no time at all. Admittedly my tool is not fully automated yet but one place was charging $7500 to do a 50,000 page website. There must be more to it than what I am doing 😉
Latent Semantic Search Engines
After my exams I am going to play with building a search engine. I want to do this purely in an academic capacity. For those search engine guru’s out there please keep the following in mind. I only started reading about this stuff in the last few days so there are probably several gaping holes in my descriptions below. I will correct anything anyone sends to me.
I know someone has already built a vector space engine in Perl and there has been an article wrote about it but I would like look at LSI and how to go about building a serach engine using it. I also know about not re-inventing the wheel but I learn by doing. I will probably use some of the code from the Perl article where I do I will mention this and any changes I make.
Basically:
1. Reduce a document using a stop list. I might go ethical here 😉
2. Removing all words that have no semantic meaning.
3. Removing words that appear in every document.
4. Removing words that appear in one document only.
The LSI model stores document representations in a matrix called a “Term Document Matrix”. The LSI model does not care where the word appears in the docuement. I personally think that this is a good indicator of how relevant multiple words in a search string are to each other. After the TDM has been made, and a result set found further ranking can be given to documents based on their location with respect to each other in the document.
LSI uses a term space that has many thousands of dimension (as many as you have documents). THis term space can be reduced using a mathematical method called Singular Value Decomposition (SVD) to collapse this high dimensional space into more manageable numbers and in the process the documents with semantically similar meanings are are further grouped together.
I really need to read more Knuth to see if I can find some pointers on how best to go about building a structure that can be easily manipulated etc. For development purposes I will probably do most of the spidering and preprocessing using Perl, its striing manipulation is second to none. When it comes to matrix manipulation and some heavy computational bits that I am expectin to see I am not sure what kind of performance hit I might get with using Perl so I will use C++ for that part of it if I can find a library for it, if not I might try a different method.
Anyway, enough idle rambling I am off to revise some Maths.
52.18 Million links found
7.17 Million unique links found