Well, I have managed to get my PC back in action and now have a decent sized disk (SATA 160Gb) installed so I started the robots again. They have been running for a couple of days now and so far I have collected 100,000 pages and dumped them on disk. This is on a 600k connection and it’ not running all the time. My target for testing is 1 million pages so I should have these by the end of May then the robots will be tamed a bit.
Once the pages are down I then need to figure out how I’m going to represent the documents on disk. There are various methods for this but I am intending to emulate an already popular search engine 😉 or at least do it the way they started and figure it out as I go along.
I intend to use C++ to do all the document parsing etc. This choice was made simply because I have not got time to roll my own binary trees etc or to learn a new library. I am fairly familiar with the STL so I will work with it..
my motto “the faster you can code it the sooner you find the bugs”
How true, I have been writing the word parser using the standard c++ map to hold the uniques word list. I then bumped up the file list to 5000+ and I noticed a major decrease in performance which was quickly fixed using ext/hash_map.
Onwards and upwards.