I have left the robots gathering links and pages for the last few days and the results are as follows. I am off for a couple of weeks one of which will be spent in a cottage in Wales which should be nice so there will be little or no action for quite a while here.
62.0 Million links found
11.9 Million unique links found
Weeding the database 12 Nov 03
You will see that the database has been reduced in size quite a bit. I have been running out of space so I decided to do some weeding. What I have done is fix all all the Url’s that had a fragment part. Url’s come in the following format.
The fragment part of the URL is not really required by us because it indicates a position in a document. This level od granularity is not required or any use to us, we are only interested in the document itself. I wrote a simple Perl script in conjunction with a Postgres Function to weed these out. During the process I deleted all links that where found by following the original URL with the fragment. This is what has led to the reduction in total links found. If you have a look at the latest robot code you will see that I now cater for this fragment art and strip it off before requestint the document.
55.0 Million links found
11.9 Million unique links found
Re-writing the spiders 08 Nov 03
I have been very busy lately re-writing the spiders for the search engine. I have decided to write up what I did to build the spider in the vain hope that someone may find it useful one day. I digressed several times and had some fun writing a recursive one but I eventually settled on writing an iterative robot that uses Postgres to store the links. This was partly due to already having a database with several million links in it already. Please see the link above for more details. I have also managed to download a few thousand documents for the search engine, hence the increase in the links found, this was caused by me parsing the documents that I had found when experimenting with the new robots .
85.0 Million links found