Started on a Semantic Search engine 14 Oct 03

I have started on the Semantic Search Engine. I have downloaded 250Mb of pages to test with. I then constructed a partial (test) word list from this. The word list has the term frequency occourance and originating doc id. The words have all been stemmed to reduce overhead, I used the Lingua::Stem module for this. I will create a full word list tomorrow if I get time. I also need to find a decent library in C++ becasue I don’t fancy writing my own Singular Value Decomposition library (if you know what sort of maths would be involved in doing this you also know that I am not at that level, yet! ;-). I also think Perl may be a bit slow for what I am trying to do although I am always willing to give it a try and see what happens.
67.6 Million links found
8.529 Million unique links found

Increasing performance on the database 10 Oct 03

I am in the process of reducing the total size of the database and increasing performance a bit in the process. As it stands the links_found table is holding duplicate copies of the home_page table. This was necessary because I wanted to be able to see what links belong to what web pages at the start for testing purposes. I no longer need to do this. What I am going to change is the format that they are stored in. I am going to make the following changes to the link_found table.
FORM
links_found
(
parent_url_id int4,
found_url varchar(2000)
);
TO
child_links
(
parent_url_id int4,
found_url int4
);
I have written a postgres function to carry out the migration. As you can see beneath the space savings are fantastic. I should also see an improvement in my indexes on this table. I am trying very hard to postpone buying any extra hardware. It will make it a bit more awkward to use the data now but this is not my main priority at the moment. When I have lots of disk space I can the create temporary tables for any manipulations as I require them.
You can see below that I have created a new table called child_links. I have converted all the URL’s into url_id’s which are “int4” types. I have also created indexes on this table. I can now remove all relations relating to the links_found table.
links# select relname, relfilenode, relpages from pg_class order by relpages limit 10;
relname | relfilenode | relpages
——————-+————-+———-
links_found | 188163825 | 644114
links_found_pkey | 246168688 | 588185
lf_found_url_idx | 246168682 | 559585
child_links | 246168690 | 255817
child_links_pkey | 246168692 | 216185
home_page | 188163817 | 118338
parent_url_id_idx | 299992353 | 116508
child_url_id_idx | 299992259 | 116231
home_page_pkey | 246168684 | 103223
home_page_url_key | 246168686 | 100120
hp_url_id_index | 246168683 | 15857
hp_url_id_idx | 301324542 | 15660
(12 rows)
Before I remove any of the relations I wanted to see my actual disk savings.
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3125 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 1466 6551 19% /links/pg_xlog
/dev/hdb1 9628 5523 3616 61% /links/tables
/dev/hdb2 9605 7044 2073 78% /links/temp
/dev/sda5 17364 13812 2684 84% /links/database
So that you can see what I had done to the files and where they where all pointing here is a listing of the links database directory for all the big relations.
-rw——- 1 postgres postgres 8.0k Oct 12 03:54 188163815
lrwxrwxrwx 1 postgres postgres 33 Oct 10 23:17 188163817 -> /links/pg_xlog/postgres/188163817
lrwxrwxrwx 1 postgres postgres 30 Oct 10 22:35 188163825 -> /links/temp/postgres/188163825
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:37 188163825.1 -> /links/temp/postgres/188163825.1
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:38 188163825.2 -> /links/temp/postgres/188163825.2
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:39 188163825.3 -> /links/temp/postgres/188163825.3
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:41 188163825.4 -> /links/temp/postgres/188163825.4
lrwxrwxrwx 1 postgres postgres 32 Oct 10 23:30 246168682 -> /links/tables/postgres/246168682
lrwxrwxrwx 1 postgres postgres 34 Oct 10 23:44 246168682.1 -> /links/tables/postgres/246168682.1
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.2 -> /links/tables/postgres/246168682.2
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.3 -> /links/tables/postgres/246168682.3
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:42 246168682.4 -> /links/tables/postgres/246168682.4
-rw——- 1 postgres postgres 132M Oct 12 03:54 246168683
-rw——- 1 postgres postgres 855M Oct 12 03:55 246168684
-rw——- 1 postgres postgres 871M Oct 12 03:55 246168686
-rw——- 1 postgres postgres 1.0G Oct 11 01:59 246168688
-rw——- 1 postgres postgres 1.0G Oct 11 01:38 246168688.1
-rw——- 1 postgres postgres 1.0G Oct 11 01:54 246168688.2
-rw——- 1 postgres postgres 1.0G Oct 11 01:59 246168688.3
-rw——- 1 postgres postgres 499M Oct 11 01:59 246168688.4
-rw——- 1 postgres postgres 1.0G Oct 11 14:32 246168690
-rw——- 1 postgres postgres 1.0G Oct 12 00:42 246168690.1
-rw——- 1 postgres postgres 52M Oct 12 03:55 246168690.2
-rw——- 1 postgres postgres 1.0G Oct 12 03:52 246168692
-rw——- 1 postgres postgres 750M Oct 12 03:55 246168692.1
-rw——- 1 postgres postgres 1005M Oct 12 03:55 299992259
-rw——- 1 postgres postgres 995M Oct 12 03:55 299992353
-rw——- 1 postgres postgres 130M Oct 12 03:55 301324542
After droping the relations and deleting all trace of them I had the following results. You may notice that the filenames are completely different. I had a major problem on the SCSI disk again. I had to recreate the database using initdb because some of my pg_clog files went missing. To allow me to complete a vacuum I copied the last pg_clog file to the file that was missing. I am pretty sure that this is very dangerous but it allowed me to complete the vacuum on the table. I was hoping to see some more errors but I got none. I completely dropped the database reformatted the hard disk and created a new file system on it. I had originally used the “largefile4” option of mke2fs but I have now left it at default.
]# df -m
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3124 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 2369 5648 30% /links/pg_xlog
/dev/hdb1 9628 2034 7105 23% /links/tables
/dev/hdb2 9605 2007 7110 23% /links/temp
/dev/sda5 17093 7513 8712 47% /links/database
Interesting bits from my base directory
-rw——- 1 postgres postgres 1.0G Oct 12 2003 16992
-rw——- 1 postgres postgres 927M Oct 12 2003 16992.1
-rw——- 1 postgres postgres 121M Oct 12 2003 58021849
-rw——- 1 postgres postgres 783M Oct 12 2003 58021850
-rw——- 1 postgres postgres 752M Oct 12 2003 58021852
-rw——- 1 postgres postgres 1.0G Oct 12 2003 58021854
-rw——- 1 postgres postgres 68M Oct 12 2003 58021854.1
-rw——- 1 postgres postgres 872M Oct 12 2003 58021856
-rw——- 1 postgres postgres 872M Oct 12 2003 58021857
We can see that we have dropped the total size of the database considerably.
links=# select relname, relfilenode, relpages from pg_class order by relpages desc limit 9;
relname | relfilenode | relpages
——————————–+————-+———-
child_links | 16992 | 249791
child_links_pkey | 58021854 | 139723
home_page | 16982 | 116267
parent_url_id_idx | 58021856 | 111577
child_url_id_idx | 58021857 | 111577
home_page_pkey | 58021850 | 100253
home_page_url_key | 58021852 | 96209
hp_url_id_index | 58021849 | 15434
pg_proc_proname_args_nsp_index | 16640 | 125
50 Million links found
6.7 Million unique links found

Few million links more 09 Oct 03

Another day, another few million links. Since moving some of the data around and recreating the database there is definitely an increase in performace. I vacuum the database regularly because of the amount of updates that take place.
I am off into London on Saturday in search of some more hardware, I never intend to use www.scan.co.uk again. I am going to have a trawl around the computer fairs to see what I can find. I would really like to get a dual chip motherboard, and run a couple of the new Opterons on it. I will have to see what I can afford first then decide on what to do. At the moment it is not really processing power that is limiting me it’s the I/O on the system. I have currently got 1Gb of RAM installed which is the most my motherboard can handle. The disks I am using are not really the quickest in the world either so I need to get some decent 80 Conductor cables for the IDE disks. If there was more RAM in the PC and a few more disks to move some of the database files onto the Athlon XP1700 would start to suffer. I have been looking at the MSI and Tyan motherboards with onboard SATA. They are expensive but they would be the perfect choice for what I am doing. I really wish I had more room, I could then build some smaller PC’s to run more robots.
I am going to start rethinking the layout of the tables. For instance at the moment I am storing duplicate links in the links_found table that are already in the home_page table. These links vary in size from fairly small to massive so I think that an integer value taken from the home_page(url_id) column would be a more efficeint use of space. I am also think that, because the CPU is being under utilised,I should seperate the downloading of the pages from the parsing. This would mean I could get more efficient use of all the resources currently open to me.
44.8 Million links found
6.44 Million unique links found

Creating a new Postgres database 08 Oct 03

After the amount of moving around that I did yesterday it is hard to believe that nothing went wrong. Lots of things went wrong but I will not bore you with them they are trivial to what I did today.
I started moving the “/usr/local/pgsql/data/base/links_database/files” around to make room in various places and managed to corrupt one of them. You can imagine my panic. I quickly put everything back to where it was and started Postgres and checked to see if the file I had corrupted was an index or a table. If it had been a index I could have just dropped it and recreated it but, it was the links_found table. However because I make regular backups of everything this gave me the chance to test them out.
Just recently I made some comments that I was going to use gzip, rather than bzip2, to do my backups, due to the amount of time that bzip2 takes to do anything. Trust my luck the gzip backup that I had taken recently was corrupted giving all sort of end of file errors etc. I need to investigate this because I cannot afford the time, to use bzip2 for my backups.
I used an old backup file from a few days ago. This is how I recreated the database for anyone that is interested.
]# psql template1
template1=# drop database links;
template1=# \q
]# createdb links
]# cat /links_database_01_10_03_15\:12\:57.sql.bz2 | bunzip2 | psql links
It looks like a very old backup from 7 days ago, but since the robots have not been running for a few days during that time so there was little data loss. The whole operation took a couple of hours. I also tried something a bit dangerous, when the database was halfway through re-creating itself I checked to see which data files had been created and had reached the max limit of 1Gb. I then moved the oldest of these to another file system and created a soft link to it. I know that there is probably an easier way do it than this, but because the database is bigger than any of my file systems I needed a quick and dirty method to free file space to avoid running out of room. If anyone knows a better way to do this I would like to know what it is, you know how to contact me.
Just when I thought everything was ok, I got the following error.
links=# select now (), count(*) from links_found;
ERROR: cannot read block 477160 of links_found: Input/output error
links=# select relname, relfilenode, relpages from pg_class order by
relname | relfilenode | relpages
——————————–+————-+———-
links_found | 71509890 | 456987
links_found_pkey | 112850939 | 418056
lf_found_url_idx | 112850933 | 397954
home_page | 71509882 | 90268
home_page_pkey | 112850935 | 77280
home_page_url_key | 112850937 | 74141
hp_url_id_index | 112850934 | 11990
pg_proc_proname_args_nsp_index | 16640 | 125
pg_proc | 1255 | 58
pg_depend | 16598 | 20
I knew this could mean I have a problem on my file system, I was having visions of one of my disks now being completely screwed. I found out what file system it was on using the above filename and then did the following.
[root@harry 71509876]# /etc/init.d/postgres stop
Stopping PostgreSQL: ok
[root@harry 71509876]# cd
[root@harry root]# umount /dev/sda5
[root@harry root]# e2fsck -c /dev/sda5
e2fsck 1.26 (3-Feb-2002)
Checking for bad blocks (read-only test): done
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found… invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 766: 206483
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
File /data/base/71509876/71509890.3 (inode #766, mod time Wed Oct 8 13:16:55 2003)
has 1 duplicate block(s), shared with 1 file(s):
[root@harry 71509876]#
We can see straight away that there is a problem on my links_found table again. To fix this I ran e2fsck using the “-f” option and chose the defaults when asked questions. I ran it again to make sure that the defaults where not causing any trouble, and the database is now back in business.
40 Million links found
6 Million unique links found

RAM and shmmax 03 Oct 03

Watford Electronics only do 512Mb sticks so I am limiting my upgrade options in the future, but I need the RAM so they got the sale. I put the two 512Mb into the machine and it worked a treat. Now all I need to do is compile a kernel that has support for the Highpoint HPTxxx Controllers and edit the postgres.conf file and shmmax settings so that we can take advantage of the new RAM.
Unfortunately compiling a kernel for this beast is not really that straight forward. I needed to get the latest up2date packages and keys etc from redhat then I needed to download the source rpm and install that. I do not have an old .config file for this machine so I had to use one of the ones provided in the source install and customise it for my requirements. I was a bit surprised when it worked first time. Needless to say I have left out a lot of the details, because it was a very frustrating exercise.
I managed to move the 20Gb disk out of the PC for J’s dad. This involved moving several of the Postgres files around and onto disks that are not really ideal places to have them. I will sort this out before I start the robots again in a few days time. So I am off to build a PC . They need the PC because they need to research cycling across Canada on a tandem.
35.876 Million links found
5.476 Million unique links found

RAM and shmmax 02 Oct 03

I am going to build a PC for J’s parents from old bits that I have in storage or should I say bits that I have no room for here. I am taking the 512Mb PC133 from this machine and adding the 1Gb stick that arrived today along with the Maxtor 160 SATA drive and Highpoint Controller 1542-4. I am also giving him one of my 20Gb drives because he’s only got one 1.7Gb and a 127Mb drive in his old PC.
Needless to say the hardware that arrived from www.scan.co.uk today was a bit dodgy. The RAM completely trashed one of our PC’s and we are now unable to get it running again. I now need to get more RAM from Watford Electronics to replace the piece of crap www.scan.co.uk sent. Unfortunately Watford Electronics are not open until tomorrow so its back to the old config for a little bit more spidering.
35.876 Million links found
5.476 Million unique links found

Curtailing the robots

Well, I managed to get back from Portsmouth OK. As soon as I got home I started the robots, suprise suprise. It is actually 29th 04:42 so I am a wee bit tired. My other half decided she wanted to update her website so I had to curtail the spidering a little because it plays havoc with any other operation on the PC, at least it will do until I get more RAM.
28.6M links_found
4.59M home_page

Extend the Project

I want to extend this project to make it a bit more interesting. I was thinking the other night what if I could run it like distributed.net that we could get a lot more down. My proposal would be for members to volunteer to start their own links harvesters and to upload them to a central repository after indexing. I am intending on purchasing some more RAM and some big IDE drives ( unless someone wants to donate me some for this project beg beg )
As far as I am aware its not the bandwidth and harvesting its the actual searching that is costly so any distributed search engine would need to be able to search across a distributed network. This would probably require some standardisation ie some sort of search data exchange protocol that allows easy calculation at the front end.
Does anyone want to volunteer for some harvesting. I can provide all source and directions on how to get started. I would prefer people with some knowledge of Postgres and Perl. You can contact me at harry[ at ]hjackson[ dot ]org a dial up connection is probably not much use either. If we got enough members we could even start thinking about building a distributed search engine for a laugh.
Anyway I am off to have a few beers in Portsmouth at a birthday party so the robots are going off for a while. Enjoy the rest of the weekend.
27M links_found
4.4M home_page

Running out of IO

I started today with just over 1.3M links found this morning. I am going to jack the robots up leave them for a while.
I am quickly running out of IO on my system. I need a shit load of RAM because the database tables and indexes are getting very big or at least in terms of my system they are getting really big.
So far I have found 12.4 Million links and I have confirmed 1.3 Million of them. The indexes alone are approaching 2Gb in size so I also need more disks. The 18Gb Fujitsu SCSI u160 68 pin MAN3184MP is getting tight on usable space. I also need to start thinking about moving the logs to a disk on their own and to start splitting up the tables onto seperate disks to avoid having the SCSI do all the work.
12.4M links_found
1.3M home_page

Multiple Robots

I finished the robot last night to do the checking but I am going to wait until I have enough links in the home_page table before I start running the checker. I have been using vmstat and iostat to monitor swap my system and it is not being taxed in the slightest so I am going to start using more than one robot.
1.1M links_found
1.2M home_page
17K downloaded and parsed