You will see that the database has been reduced in size quite a bit. I have been running out of space so I decided to do some weeding. What I have done is fix all all the Url’s that had a fragment part. Url’s come in the following format.
The fragment part of the URL is not really required by us because it indicates a position in a document. This level od granularity is not required or any use to us, we are only interested in the document itself. I wrote a simple Perl script in conjunction with a Postgres Function to weed these out. During the process I deleted all links that where found by following the original URL with the fragment. This is what has led to the reduction in total links found. If you have a look at the latest robot code you will see that I now cater for this fragment art and strip it off before requestint the document.
55.0 Million links found
11.9 Million unique links found