Its now 03:19 and the robots have been off and on all day today. I managed to fill up another partition so I was forced into moving more files around to free up some much needed space.
Performance is starting to slip as can be seen from the daily stats. I bought a 160Gb SATA drive with controller. I am hoping this will last me a few months before I need another one unless I get some volunteers to help with the project.
I also invested in 1Gb of RAM, unfortunatley I am loosing the 512Mb from this machine so I am only doubling my capacity, 1.5Gb would have been lovely. I am not looking forward to installing it. I am pretty sure that the current kernel 2.4.9-e.5 will not support the HPT374. I am expecting the gear in a couple of days so after the install we should have some fun.
My maths was the main priority today. I have finished the guts of my final peice of course work. That goes in the post tomorrow and then its exam cram time for my exams. I have another 4 years of Maths to look forward to on top of the three years I have already done so I am nearly half way there.
I am going to start adding some more technical content to this site so that anyone who is interested can see what I am doing to postgres to keep it running. I am not sure exactly what I am going to add yet but I thought I would start with the scripts that I used to build the database and the psql functions I call from Perl to populate the database. I will also start adding the Perl code as well and some instructions on how to set up your own link harvester……… eventually.
Well, I managed to get back from Portsmouth OK. As soon as I got home I started the robots, suprise suprise. It is actually 29th 04:42 so I am a wee bit tired. My other half decided she wanted to update her website so I had to curtail the spidering a little because it plays havoc with any other operation on the PC, at least it will do until I get more RAM.
I want to extend this project to make it a bit more interesting. I was thinking the other night what if I could run it like distributed.net that we could get a lot more down. My proposal would be for members to volunteer to start their own links harvesters and to upload them to a central repository after indexing. I am intending on purchasing some more RAM and some big IDE drives ( unless someone wants to donate me some for this project beg beg )
As far as I am aware its not the bandwidth and harvesting its the actual searching that is costly so any distributed search engine would need to be able to search across a distributed network. This would probably require some standardisation ie some sort of search data exchange protocol that allows easy calculation at the front end.
Does anyone want to volunteer for some harvesting. I can provide all source and directions on how to get started. I would prefer people with some knowledge of Postgres and Perl. You can contact me at harry[ at ]hjackson[ dot ]org a dial up connection is probably not much use either. If we got enough members we could even start thinking about building a distributed search engine for a laugh.
Anyway I am off to have a few beers in Portsmouth at a birthday party so the robots are going off for a while. Enjoy the rest of the weekend.
I started today with just over 1.3M links found this morning. I am going to jack the robots up leave them for a while.
I am quickly running out of IO on my system. I need a shit load of RAM because the database tables and indexes are getting very big or at least in terms of my system they are getting really big.
So far I have found 12.4 Million links and I have confirmed 1.3 Million of them. The indexes alone are approaching 2Gb in size so I also need more disks. The 18Gb Fujitsu SCSI u160 68 pin MAN3184MP is getting tight on usable space. I also need to start thinking about moving the logs to a disk on their own and to start splitting up the tables onto seperate disks to avoid having the SCSI do all the work.
This was actually the date where this website came on line.
No spidering today I had a much harder task to do. I attempted to teach my better half the art of vimming and much to my dismay she picked it up quicker than I did.
I am jealous.
Some numbers on harvested links
links=# select count(*) from links_found;
links=# select count(*) from home_page;
I finished the robot last night to do the checking but I am going to wait until I have enough links in the home_page table before I start running the checker. I have been using vmstat and iostat to monitor swap my system and it is not being taxed in the slightest so I am going to start using more than one robot.
17K downloaded and parsed
I have just finished the first Robot and have seeded the database with two links www.uklug.co.uk and www.cnn.co.uk and started it running. I am going to start writing another robot that checks the headers of the webpages. I am doing this so that I can spot errors and mark them not to be retrieved. I also need to compile a list of document types that I do not want to download.
I just got back from N Ireland and sat down today and drew up some plans. I have Oracle Postgres and MySQL to choose from as a backend to this system.
Those who have worked with Oracle will know that on a system like mine you would be lucky to get it running due to memory requirements. I have it installed and working but I had to raise the shred memory to over 300Mb to get it to work properly. This does not leave much room for anything else on my box. There is also the license issue. I may want to put the database online some time which means lots of cash.
This is a fantastic database. It has a reasonable memory footprint for small apps and can scale very well. I have never thrashed it but according to reports 7.3 is very fast compared to 7.2.
I am afraid that this database although a very good product has several features missing that I cannot live without.
I have decided to use Postgres. I have also decided that the robots will be written in Perl. The reasons for using Perl are quite simple, I know it and its versatile, you can pretty much do anything you want in it. For those die hards who may claim that it wont be fast enough remember I am not limited by the robots but several other factors like bandwidth etc. The robots will utilise some pre-packaged modules designed just for the task at hand. I will post the code at some point.
–These are the tables that I am going to be using
create table home_page
url_id int4 DEFAULT NEXTVAL(‘url_id_pk’),
url varchar(2000) unique,
state int4 default 0,
date_time timestamp default now(),
PRIMARY KEY(url_id, url)
create table links_found
PRIMARY KEY(parent_url_id, found_url)