Sorting out the code 27 Oct 03

The last couple of days have been spent sorting out some of the perl and C++. I have also expanded the stop list quite a bit. The Perl script that I was using to produce the file to build the term document matrix also got a bit of a working over.
I have increased the document list to 15700 which is still relatively small for an internet search engine but it is now a respectable amount of text to search for a small intranet site, like a small law firm. I will gradually increase this as I go along and testing to see what kind of results I get.
I have decided to write up what I have done with example code and put it on another few pages. Hopefully someone will be able to make some use of it.
Please see my:
Vector Space Search Engine
page for more details of what I have done to get this working, I am using the term working in its weakest sense here since I have been unable to test it properly yet.
83.1 Million links found
10.9 Million unique links found

Getting the vector space search engine runing 25 Oct 03

I have spent the last few days trying to get the Vector Space Search engine running. The code is in a bit of a mess at the moment but, it’s comming along. All I can say is thank god for the STL, without it I would been in for a hell of a job. I have now managed to create a Sparse Vector Space Matrix from 2397 documents. This needs to be increased before I can really start testing any weighting algorithms.
At the moment this is using 30Mb of memory. This is the max used during the entire process. I did have it running at 256Mb but this was my first round at designing the matrix. I then showed my program a copy of Knuth volume 3, it cowered in fear and its shoesize quickly dropped to a more respectable size. I am pretty sure that I could drop this even further by writing my own data structure without using the STL but I am happy with it at the moment.
I am not entirely happy with the output of the program yet becasue the inner product routine is not producing the correct output but this should be reletively easy to fix. I really need to do a review to make sure that I am not missing any points in my methodology.
I also need to try and compile a better stop list. The one I am using is not particularly good. This is a sure fire way of reducing the RAM footprint.
83.1 Million links found
10.9 Million unique links found

Joys of DOS 5 20 Oct 03

Well I managed to get to play with DOS 5 today to get the PC working, don’t ask. I then had trouble trying to get the data off the old drive that was in the original PC because it insisted that ‘it’ was the operating system and that it was going to boot no matter where on the IDE channels you put it. It also seemed to ignore various jumper settings which was very odd. A quick format /mbr taught it who the boss was and we had no more trouble from the cheap seats.
After not having much success with www.realemail.co.uk due to spurious line disconnections we eventually got online with uklinux which was an ISP I used to use when Jason Clifford ran it. Unfortunately he is no longer concerned with it for various reasons, but he has started www.ukfsn.org which I can highly recommend.
I then downloaded ZoneAlarm and AVG for them. Unfortunately I forgot my Video cable or they could have been using a 19″ Iiyama Vision Master instead of the old 14″ jurassic valve driven thing they had been using previously.
35.876 Million links found
5.476 Million unique links found

20 Oct 03

After all the shuffling that I did two days ago to get the 20Gb disk out of the machine, I now need to check all the Postgres files to make sure that I have them in all the correct places and restart the robots to get the count up a little. Moving Postgres File
35.876 Million links found
5.476 Million unique links found

C++ Libraries for sparse matrix manipulations are not easy to find 20 Oct 03

After much hunting for a library that I can use to impliment an LSI search engine I have had little luck. The library that seems to be the job for this is called SVDPACK it is written in Fortran and has been ported to C++. However, it has not been ported to the humble x86 architecture. It looks like I will have to run with writing the Vector Space search engine instead.
I managed to write a C++ routine to get the output of the Perl Term Document parser. This is a very simple parser that splits all the words in the document on whitespace. I know that there are reasons for not doing it like this so if I get time I will come up with a better method later but for now it will do.
My next task now is to take the input of the C++ program and create a Term Document Matrix from that that I can manipulate easily. I need to be able to carry out the following actions and quite a few more.
1. Count all occourances of each word in the entire document set.
2. Calculate mean values for each word.
3. Come up with some method to ran words in the document matrix. This is to avoid the typical abuses that you see where website saturate pages with keywords to try and manipulate the results of a search engine.
67.6 Million links found
8.529 Million unique links found

Started on a Semantic Search engine 14 Oct 03

I have started on the Semantic Search Engine. I have downloaded 250Mb of pages to test with. I then constructed a partial (test) word list from this. The word list has the term frequency occourance and originating doc id. The words have all been stemmed to reduce overhead, I used the Lingua::Stem module for this. I will create a full word list tomorrow if I get time. I also need to find a decent library in C++ becasue I don’t fancy writing my own Singular Value Decomposition library (if you know what sort of maths would be involved in doing this you also know that I am not at that level, yet! ;-). I also think Perl may be a bit slow for what I am trying to do although I am always willing to give it a try and see what happens.
67.6 Million links found
8.529 Million unique links found

Checking URL’s 12 Oct 03

I just noticed a few sites that offer link checking on a commercial basis. Does anyone know if there is any money in this because it is a surprisingly easy thing to do. In one month I managed to check the links on over 500,000 pages and produce error codes for every link. This is from a very humble machine and across the internet. If I was on an intranet with a 100Mb connection onto the webserver I could trawl an entire site at quiet times in no time at all. Admittedly my tool is not fully automated yet but one place was charging $7500 to do a 50,000 page website. There must be more to it than what I am doing 😉
Latent Semantic Search Engines
After my exams I am going to play with building a search engine. I want to do this purely in an academic capacity. For those search engine guru’s out there please keep the following in mind. I only started reading about this stuff in the last few days so there are probably several gaping holes in my descriptions below. I will correct anything anyone sends to me.
I know someone has already built a vector space engine in Perl and there has been an article wrote about it but I would like look at LSI and how to go about building a serach engine using it. I also know about not re-inventing the wheel but I learn by doing. I will probably use some of the code from the Perl article where I do I will mention this and any changes I make.
Basically:
1. Reduce a document using a stop list. I might go ethical here 😉
2. Removing all words that have no semantic meaning.
3. Removing words that appear in every document.
4. Removing words that appear in one document only.
The LSI model stores document representations in a matrix called a “Term Document Matrix”. The LSI model does not care where the word appears in the docuement. I personally think that this is a good indicator of how relevant multiple words in a search string are to each other. After the TDM has been made, and a result set found further ranking can be given to documents based on their location with respect to each other in the document.
LSI uses a term space that has many thousands of dimension (as many as you have documents). THis term space can be reduced using a mathematical method called Singular Value Decomposition (SVD) to collapse this high dimensional space into more manageable numbers and in the process the documents with semantically similar meanings are are further grouped together.
I really need to read more Knuth to see if I can find some pointers on how best to go about building a structure that can be easily manipulated etc. For development purposes I will probably do most of the spidering and preprocessing using Perl, its striing manipulation is second to none. When it comes to matrix manipulation and some heavy computational bits that I am expectin to see I am not sure what kind of performance hit I might get with using Perl so I will use C++ for that part of it if I can find a library for it, if not I might try a different method.
Anyway, enough idle rambling I am off to revise some Maths.
52.18 Million links found
7.17 Million unique links found

Increasing performance on the database 10 Oct 03

I am in the process of reducing the total size of the database and increasing performance a bit in the process. As it stands the links_found table is holding duplicate copies of the home_page table. This was necessary because I wanted to be able to see what links belong to what web pages at the start for testing purposes. I no longer need to do this. What I am going to change is the format that they are stored in. I am going to make the following changes to the link_found table.
FORM
links_found
(
parent_url_id int4,
found_url varchar(2000)
);
TO
child_links
(
parent_url_id int4,
found_url int4
);
I have written a postgres function to carry out the migration. As you can see beneath the space savings are fantastic. I should also see an improvement in my indexes on this table. I am trying very hard to postpone buying any extra hardware. It will make it a bit more awkward to use the data now but this is not my main priority at the moment. When I have lots of disk space I can the create temporary tables for any manipulations as I require them.
You can see below that I have created a new table called child_links. I have converted all the URL’s into url_id’s which are “int4” types. I have also created indexes on this table. I can now remove all relations relating to the links_found table.
links# select relname, relfilenode, relpages from pg_class order by relpages limit 10;
relname | relfilenode | relpages
——————-+————-+———-
links_found | 188163825 | 644114
links_found_pkey | 246168688 | 588185
lf_found_url_idx | 246168682 | 559585
child_links | 246168690 | 255817
child_links_pkey | 246168692 | 216185
home_page | 188163817 | 118338
parent_url_id_idx | 299992353 | 116508
child_url_id_idx | 299992259 | 116231
home_page_pkey | 246168684 | 103223
home_page_url_key | 246168686 | 100120
hp_url_id_index | 246168683 | 15857
hp_url_id_idx | 301324542 | 15660
(12 rows)
Before I remove any of the relations I wanted to see my actual disk savings.
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3125 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 1466 6551 19% /links/pg_xlog
/dev/hdb1 9628 5523 3616 61% /links/tables
/dev/hdb2 9605 7044 2073 78% /links/temp
/dev/sda5 17364 13812 2684 84% /links/database
So that you can see what I had done to the files and where they where all pointing here is a listing of the links database directory for all the big relations.
-rw——- 1 postgres postgres 8.0k Oct 12 03:54 188163815
lrwxrwxrwx 1 postgres postgres 33 Oct 10 23:17 188163817 -> /links/pg_xlog/postgres/188163817
lrwxrwxrwx 1 postgres postgres 30 Oct 10 22:35 188163825 -> /links/temp/postgres/188163825
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:37 188163825.1 -> /links/temp/postgres/188163825.1
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:38 188163825.2 -> /links/temp/postgres/188163825.2
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:39 188163825.3 -> /links/temp/postgres/188163825.3
lrwxrwxrwx 1 postgres postgres 32 Oct 10 22:41 188163825.4 -> /links/temp/postgres/188163825.4
lrwxrwxrwx 1 postgres postgres 32 Oct 10 23:30 246168682 -> /links/tables/postgres/246168682
lrwxrwxrwx 1 postgres postgres 34 Oct 10 23:44 246168682.1 -> /links/tables/postgres/246168682.1
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.2 -> /links/tables/postgres/246168682.2
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:41 246168682.3 -> /links/tables/postgres/246168682.3
lrwxrwxrwx 1 postgres postgres 34 Oct 11 00:42 246168682.4 -> /links/tables/postgres/246168682.4
-rw——- 1 postgres postgres 132M Oct 12 03:54 246168683
-rw——- 1 postgres postgres 855M Oct 12 03:55 246168684
-rw——- 1 postgres postgres 871M Oct 12 03:55 246168686
-rw——- 1 postgres postgres 1.0G Oct 11 01:59 246168688
-rw——- 1 postgres postgres 1.0G Oct 11 01:38 246168688.1
-rw——- 1 postgres postgres 1.0G Oct 11 01:54 246168688.2
-rw——- 1 postgres postgres 1.0G Oct 11 01:59 246168688.3
-rw——- 1 postgres postgres 499M Oct 11 01:59 246168688.4
-rw——- 1 postgres postgres 1.0G Oct 11 14:32 246168690
-rw——- 1 postgres postgres 1.0G Oct 12 00:42 246168690.1
-rw——- 1 postgres postgres 52M Oct 12 03:55 246168690.2
-rw——- 1 postgres postgres 1.0G Oct 12 03:52 246168692
-rw——- 1 postgres postgres 750M Oct 12 03:55 246168692.1
-rw——- 1 postgres postgres 1005M Oct 12 03:55 299992259
-rw——- 1 postgres postgres 995M Oct 12 03:55 299992353
-rw——- 1 postgres postgres 130M Oct 12 03:55 301324542
After droping the relations and deleting all trace of them I had the following results. You may notice that the filenames are completely different. I had a major problem on the SCSI disk again. I had to recreate the database using initdb because some of my pg_clog files went missing. To allow me to complete a vacuum I copied the last pg_clog file to the file that was missing. I am pretty sure that this is very dangerous but it allowed me to complete the vacuum on the table. I was hoping to see some more errors but I got none. I completely dropped the database reformatted the hard disk and created a new file system on it. I had originally used the “largefile4” option of mke2fs but I have now left it at default.
]# df -m
File system 1M-blocks Used Available Use% Mounted on
/dev/hdc2 3938 3124 613 84% /
/dev/hdc1 30 9 20 30% /boot
none 505 0 504 0% /dev/shm
/dev/hdc3 3938 2177 1561 59% /usr
/dev/hdc5 8439 2369 5648 30% /links/pg_xlog
/dev/hdb1 9628 2034 7105 23% /links/tables
/dev/hdb2 9605 2007 7110 23% /links/temp
/dev/sda5 17093 7513 8712 47% /links/database
Interesting bits from my base directory
-rw——- 1 postgres postgres 1.0G Oct 12 2003 16992
-rw——- 1 postgres postgres 927M Oct 12 2003 16992.1
-rw——- 1 postgres postgres 121M Oct 12 2003 58021849
-rw——- 1 postgres postgres 783M Oct 12 2003 58021850
-rw——- 1 postgres postgres 752M Oct 12 2003 58021852
-rw——- 1 postgres postgres 1.0G Oct 12 2003 58021854
-rw——- 1 postgres postgres 68M Oct 12 2003 58021854.1
-rw——- 1 postgres postgres 872M Oct 12 2003 58021856
-rw——- 1 postgres postgres 872M Oct 12 2003 58021857
We can see that we have dropped the total size of the database considerably.
links=# select relname, relfilenode, relpages from pg_class order by relpages desc limit 9;
relname | relfilenode | relpages
——————————–+————-+———-
child_links | 16992 | 249791
child_links_pkey | 58021854 | 139723
home_page | 16982 | 116267
parent_url_id_idx | 58021856 | 111577
child_url_id_idx | 58021857 | 111577
home_page_pkey | 58021850 | 100253
home_page_url_key | 58021852 | 96209
hp_url_id_index | 58021849 | 15434
pg_proc_proname_args_nsp_index | 16640 | 125
50 Million links found
6.7 Million unique links found

Few million links more 09 Oct 03

Another day, another few million links. Since moving some of the data around and recreating the database there is definitely an increase in performace. I vacuum the database regularly because of the amount of updates that take place.
I am off into London on Saturday in search of some more hardware, I never intend to use www.scan.co.uk again. I am going to have a trawl around the computer fairs to see what I can find. I would really like to get a dual chip motherboard, and run a couple of the new Opterons on it. I will have to see what I can afford first then decide on what to do. At the moment it is not really processing power that is limiting me it’s the I/O on the system. I have currently got 1Gb of RAM installed which is the most my motherboard can handle. The disks I am using are not really the quickest in the world either so I need to get some decent 80 Conductor cables for the IDE disks. If there was more RAM in the PC and a few more disks to move some of the database files onto the Athlon XP1700 would start to suffer. I have been looking at the MSI and Tyan motherboards with onboard SATA. They are expensive but they would be the perfect choice for what I am doing. I really wish I had more room, I could then build some smaller PC’s to run more robots.
I am going to start rethinking the layout of the tables. For instance at the moment I am storing duplicate links in the links_found table that are already in the home_page table. These links vary in size from fairly small to massive so I think that an integer value taken from the home_page(url_id) column would be a more efficeint use of space. I am also think that, because the CPU is being under utilised,I should seperate the downloading of the pages from the parsing. This would mean I could get more efficient use of all the resources currently open to me.
44.8 Million links found
6.44 Million unique links found