This was actually the date where this website came on line.
1.3M links_found
1.2M home_page
Teaching Jenny Vim
No spidering today I had a much harder task to do. I attempted to teach my better half the art of vimming and much to my dismay she picked it up quicker than I did.
I am jealous.
Some numbers on harvested links
Some numbers on harvested links
links=# select count(*) from links_found;
count
———
4159023
links=# select count(*) from home_page;
count
Multiple Robots
I finished the robot last night to do the checking but I am going to wait until I have enough links in the home_page table before I start running the checker. I have been using vmstat and iostat to monitor swap my system and it is not being taxed in the slightest so I am going to start using more than one robot.
1.1M links_found
1.2M home_page
17K downloaded and parsed
Finished the first Robot
I have just finished the first Robot and have seeded the database with two links www.uklug.co.uk and www.cnn.co.uk and started it running. I am going to start writing another robot that checks the headers of the webpages. I am doing this so that I can spot errors and mark them not to be retrieved. I also need to compile a list of document types that I do not want to download.
120K links_found
120K home_page
Choosing a Database
I just got back from N Ireland and sat down today and drew up some plans. I have Oracle Postgres and MySQL to choose from as a backend to this system.
Oracle
Those who have worked with Oracle will know that on a system like mine you would be lucky to get it running due to memory requirements. I have it installed and working but I had to raise the shred memory to over 300Mb to get it to work properly. This does not leave much room for anything else on my box. There is also the license issue. I may want to put the database online some time which means lots of cash.
Postgres
This is a fantastic database. It has a reasonable memory footprint for small apps and can scale very well. I have never thrashed it but according to reports 7.3 is very fast compared to 7.2.
MySQL
I am afraid that this database although a very good product has several features missing that I cannot live without.
I have decided to use Postgres. I have also decided that the robots will be written in Perl. The reasons for using Perl are quite simple, I know it and its versatile, you can pretty much do anything you want in it. For those die hards who may claim that it wont be fast enough remember I am not limited by the robots but several other factors like bandwidth etc. The robots will utilise some pre-packaged modules designed just for the task at hand. I will post the code at some point.
–These are the tables that I am going to be using
create table home_page
(
url_id int4 DEFAULT NEXTVAL(‘url_id_pk’),
url varchar(2000) unique,
state int4 default 0,
date_time timestamp default now(),
last_modified_time timestamp,
PRIMARY KEY(url_id, url)
);
create table links_found
(
parent_url_id int4,
found_url varchar(2000),
PRIMARY KEY(parent_url_id, found_url)
);
Building a search engine 08 Sep 03
I am playing with the idea of building a search engine. I think that this idea is a bit more than I can chew but hey, what the hell, I am sure no one in their right mind has though of doing it before, not!
I had a good think about what I could do that would be a bit smaller and marginally easier to manage so I have decided to store URL’s. I know that this is still going to be a massive undertaking but I have never been one to shirk from the unsurmountable plus I get to play with lots of data something I like to do. There is no real purpose to this other than me getting to play with massive amounts of data. I also get to thrash the crap out of my NTL connection and my little box.
I have a 1Mb connection which in the UK is considered fairly fast. I am running a Athlon XP 1700 with 512Mb RAM. I have 4 hard disks one of which is an 18Gb SCSI the rest are 20Gb IDE. You can see that I am already going to run out of RAM and HD space in a fairly short space of time.
Politics
I have a few rules about politics:
Rule 1. Do not talk about Politics.
1. You do not talk about Politics.
2. You do not talk about Politics.
3. When someone wants to talk or tries to talk about politics, the talk is over.
4. No two guys to a talk.
5. Zero talks at a time.
6. No left, no right.
7. Talks do not go on for any period of time.
8. If this is your first talk about politics, you have to shut the hell up!
I have many reasons for this none of which I will go into because of rule 1. Unfortunately I am sorely tempted to speak out about current events going on around the world and invariably topics close to my heart get close to politics but if I keep rule 1 in mind I will be much happier and release more good karma into the world 😉
Linking to external sites
I often find links on sites that open a new window rather than opening in the window that the link was found in. I can only assume that the web designer made the assumption that they know what the user wants better than the user. What they have effectively done is create a reverse “pop-behind” advert to themselves which is a bit mad when you think about it. Given the choice I will always use normal links on my sites.
If I want a site to open in a new window then
I will make that decision myself. Who are you to tell me what I should be doing.
No time anymore.
I remember when I was a kid that I always seemed to bored out of my scull. I could never find anything that I wanted to do, I had no hobbies that did not involve something I shouldn’t be doing and there was not really an awful lot to do in Cullybackey.
Things did not change when I joined the Navy. This is one organisation that makes watching a kettle boil positively exhilarating. I decided to leave the Navy after 8 years of mental abuse from which I think I have made a full recovery. I am only thankfull I was not in when we went to the gulf to take part in what I think may become one of the biggest fiascos of this century.
Since leaving the Navy 3 years ago I have not got enough hours in the day and it seems to be getting worse. We where taught time management in the Navy ie. if you don’t manage it you are in the shit but I seem to be struggling to do everything I want to do and the list just keeps getting longer.
Looking back I wouldn’t change anything but I sometimes crave for an hour of boredom.