November 2003 - Harry Jackson

November 20, 2003

Off to wales 20 Nov 03

I have left the robots gathering links and pages for the last few days and the results are as follows. I am off for a couple of weeks one of which will be spent in a cottage in Wales which should be nice so there will be little or no action for quite a while here.
62.0 Million links found
11.9 Million unique links found

November 12, 2003

Weeding the database 12 Nov 03

You will see that the database has been reduced in size quite a bit. I have been running out of space so I decided to do some weeding. What I have done is fix all all the Url’s that had a fragment part. Url’s come in the following format.

The fragment part of the URL is not really required by us because it indicates a position in a document. This level od granularity is not required or any use to us, we are only interested in the document itself. I wrote a simple Perl script in conjunction with a Postgres Function to weed these out. During the process I deleted all links that where found by following the original URL with the fragment. This is what has led to the reduction in total links found. If you have a look at the latest robot code you will see that I now cater for this fragment art and strip it off before requestint the document.
55.0 Million links found
11.9 Million unique links found

November 8, 2003

Re-writing the spiders 08 Nov 03

I have been very busy lately re-writing the spiders for the search engine. I have decided to write up what I did to build the spider in the vain hope that someone may find it useful one day. I digressed several times and had some fun writing a recursive one but I eventually settled on writing an iterative robot that uses Postgres to store the links. This was partly due to already having a database with several million links in it already. Please see the link above for more details. I have also managed to download a few thousand documents for the search engine, hence the increase in the links found, this was caused by me parsing the documents that I had found when experimenting with the new robots .
85.0 Million links found