Another day, another few million links. Since moving some of the data around and recreating the database there is definitely an increase in performace. I vacuum the database regularly because of the amount of updates that take place.
I am off into London on Saturday in search of some more hardware, I never intend to use www.scan.co.uk again. I am going to have a trawl around the computer fairs to see what I can find. I would really like to get a dual chip motherboard, and run a couple of the new Opterons on it. I will have to see what I can afford first then decide on what to do. At the moment it is not really processing power that is limiting me it’s the I/O on the system. I have currently got 1Gb of RAM installed which is the most my motherboard can handle. The disks I am using are not really the quickest in the world either so I need to get some decent 80 Conductor cables for the IDE disks. If there was more RAM in the PC and a few more disks to move some of the database files onto the Athlon XP1700 would start to suffer. I have been looking at the MSI and Tyan motherboards with onboard SATA. They are expensive but they would be the perfect choice for what I am doing. I really wish I had more room, I could then build some smaller PC’s to run more robots.
I am going to start rethinking the layout of the tables. For instance at the moment I am storing duplicate links in the links_found table that are already in the home_page table. These links vary in size from fairly small to massive so I think that an integer value taken from the home_page(url_id) column would be a more efficeint use of space. I am also think that, because the CPU is being under utilised,I should seperate the downloading of the pages from the parsing. This would mean I could get more efficient use of all the resources currently open to me.
44.8 Million links found
6.44 Million unique links found