I have just finished the first Robot and have seeded the database with two links www.uklug.co.uk and www.cnn.co.uk and started it running. I am going to start writing another robot that checks the headers of the webpages. I am doing this so that I can spot errors and mark them not to be retrieved. I also need to compile a list of document types that I do not want to download.
120K links_found
120K home_page
Choosing a Database
I just got back from N Ireland and sat down today and drew up some plans. I have Oracle Postgres and MySQL to choose from as a backend to this system.
Oracle
Those who have worked with Oracle will know that on a system like mine you would be lucky to get it running due to memory requirements. I have it installed and working but I had to raise the shred memory to over 300Mb to get it to work properly. This does not leave much room for anything else on my box. There is also the license issue. I may want to put the database online some time which means lots of cash.
Postgres
This is a fantastic database. It has a reasonable memory footprint for small apps and can scale very well. I have never thrashed it but according to reports 7.3 is very fast compared to 7.2.
MySQL
I am afraid that this database although a very good product has several features missing that I cannot live without.
I have decided to use Postgres. I have also decided that the robots will be written in Perl. The reasons for using Perl are quite simple, I know it and its versatile, you can pretty much do anything you want in it. For those die hards who may claim that it wont be fast enough remember I am not limited by the robots but several other factors like bandwidth etc. The robots will utilise some pre-packaged modules designed just for the task at hand. I will post the code at some point.
–These are the tables that I am going to be using
create table home_page
(
url_id int4 DEFAULT NEXTVAL(‘url_id_pk’),
url varchar(2000) unique,
state int4 default 0,
date_time timestamp default now(),
last_modified_time timestamp,
PRIMARY KEY(url_id, url)
);
create table links_found
(
parent_url_id int4,
found_url varchar(2000),
PRIMARY KEY(parent_url_id, found_url)
);
Building a search engine 08 Sep 03
I am playing with the idea of building a search engine. I think that this idea is a bit more than I can chew but hey, what the hell, I am sure no one in their right mind has though of doing it before, not!
I had a good think about what I could do that would be a bit smaller and marginally easier to manage so I have decided to store URL’s. I know that this is still going to be a massive undertaking but I have never been one to shirk from the unsurmountable plus I get to play with lots of data something I like to do. There is no real purpose to this other than me getting to play with massive amounts of data. I also get to thrash the crap out of my NTL connection and my little box.
I have a 1Mb connection which in the UK is considered fairly fast. I am running a Athlon XP 1700 with 512Mb RAM. I have 4 hard disks one of which is an 18Gb SCSI the rest are 20Gb IDE. You can see that I am already going to run out of RAM and HD space in a fairly short space of time.
Uklug Coming on
I have now got,
A login facility.
Agency Ratings.
Job Postings.
I have also added a few other bits and bobs. I am hoping to start adding some agencies to the database so that we will at least have some content that users can browse.
I will write all this myself initially but I really should investigate some of the Forums and other tools that are out there at the moment. No doubt they will have done a better job at this stuff than me.
Started Recruitment Agency Ratings Website
Managed to get started on uklug for the first time today. I have been meaning to get stuck into getting something up for a long time and this has been my chance. I need to be doing something to hone my programming skills and I think building a dynamic site in Perl although corny is as good as place as any to make a start.
This project was originally on Sourceforge as Jobix but due to lack of interest it kind of faded into the back ground. It is still a one man show which means that features and developement will be fairly slow