I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl’s
m/\w/
if it doesn’t match this it won’t go in the lexicon. I know this is a bit harsh but unfortunately I don’t have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.
Search engine restarted
Well, I have managed to get my PC back in action and now have a decent sized disk (SATA 160Gb) installed so I started the robots again. They have been running for a couple of days now and so far I have collected 100,000 pages and dumped them on disk. This is on a 600k connection and it’ not running all the time. My target for testing is 1 million pages so I should have these by the end of May then the robots will be tamed a bit.
Once the pages are down I then need to figure out how I’m going to represent the documents on disk. There are various methods for this but I am intending to emulate an already popular search engine 😉 or at least do it the way they started and figure it out as I go along.
I intend to use C++ to do all the document parsing etc. This choice was made simply because I have not got time to roll my own binary trees etc or to learn a new library. I am fairly familiar with the STL so I will work with it..