Page Rank

I turned on Jenny’s PC today and noticed that I had a page rank of 5. Although how I have managed to get that is another matter.
I have noticed that the hits have been going up recently and I imagine it has something to do with this although I think people will quickly leave when they realise that there is not actually that much content up on my sites yet at least none of any noteworthy quality.
I have tried to add my websites to DMOZ a few times and so far I have had no joy getting any of them in anywhere. The first time I tried it DMOZ was not playing ball at all and refused to work????

14 Jan

Trying to get up to speed with Parrot and IMCC. It is proving quite difficult. There are rumours of a Feb release of Parrot, I am looking forward to seeing if it stirs up some interest. It is certainly a lot more mature than people give it credit for.

04 Jan 04

I am currently investigating movable type. It seems like the danglies when it comes to doing blogs but this would involve me getting the time to install it and at the moment, time is a bit of a luxury for me.

Should I add a comment feature 28 Dec 03

I suppose I should add a tool where people can leave comments but that might be asking for trouble and that is a little bit more involved than writing my own RSS generator. Its still fairly straight forward though if you have any experience with a database and some free time, something I seem to get in fits and starts. I have tool that allows people to comment on recruitment agencies but it is failry basic and compared to some of the Forums around today, it’s not even in the same league.

RSS Job Database 23 Dec 03

I have been playing with RSS for a few days and have now got an RSS Job database. I spen ages trying to find RSS feeds for this and so far have not sound very many. The database can be found here. An example URL which can be used to search and create RSS feeds from the database is as follows:
http://www.uklug.co.uk/cgi-bin/getjobs.rss?K=perl%20london&M=2&L=100&D=2592000&C=10&npo=1
This link creates an RSS Version 1.0 feed based on a search from the database. You can see from the URL that we are searching for the terms “perl” and “london”. For more information on how to use the database please see the help page

Sorting out the code 27 Oct 03

The last couple of days have been spent sorting out some of the perl and C++. I have also expanded the stop list quite a bit. The Perl script that I was using to produce the file to build the term document matrix also got a bit of a working over.
I have increased the document list to 15700 which is still relatively small for an internet search engine but it is now a respectable amount of text to search for a small intranet site, like a small law firm. I will gradually increase this as I go along and testing to see what kind of results I get.
I have decided to write up what I have done with example code and put it on another few pages. Hopefully someone will be able to make some use of it.
Please see my:
Vector Space Search Engine
page for more details of what I have done to get this working, I am using the term working in its weakest sense here since I have been unable to test it properly yet.
83.1 Million links found
10.9 Million unique links found

Joys of DOS 5 20 Oct 03

Well I managed to get to play with DOS 5 today to get the PC working, don’t ask. I then had trouble trying to get the data off the old drive that was in the original PC because it insisted that ‘it’ was the operating system and that it was going to boot no matter where on the IDE channels you put it. It also seemed to ignore various jumper settings which was very odd. A quick format /mbr taught it who the boss was and we had no more trouble from the cheap seats.
After not having much success with www.realemail.co.uk due to spurious line disconnections we eventually got online with uklinux which was an ISP I used to use when Jason Clifford ran it. Unfortunately he is no longer concerned with it for various reasons, but he has started www.ukfsn.org which I can highly recommend.
I then downloaded ZoneAlarm and AVG for them. Unfortunately I forgot my Video cable or they could have been using a 19″ Iiyama Vision Master instead of the old 14″ jurassic valve driven thing they had been using previously.
35.876 Million links found
5.476 Million unique links found

20 Oct 03

After all the shuffling that I did two days ago to get the 20Gb disk out of the machine, I now need to check all the Postgres files to make sure that I have them in all the correct places and restart the robots to get the count up a little. Moving Postgres File
35.876 Million links found
5.476 Million unique links found

Checking URL’s 12 Oct 03

I just noticed a few sites that offer link checking on a commercial basis. Does anyone know if there is any money in this because it is a surprisingly easy thing to do. In one month I managed to check the links on over 500,000 pages and produce error codes for every link. This is from a very humble machine and across the internet. If I was on an intranet with a 100Mb connection onto the webserver I could trawl an entire site at quiet times in no time at all. Admittedly my tool is not fully automated yet but one place was charging $7500 to do a 50,000 page website. There must be more to it than what I am doing 😉
Latent Semantic Search Engines
After my exams I am going to play with building a search engine. I want to do this purely in an academic capacity. For those search engine guru’s out there please keep the following in mind. I only started reading about this stuff in the last few days so there are probably several gaping holes in my descriptions below. I will correct anything anyone sends to me.
I know someone has already built a vector space engine in Perl and there has been an article wrote about it but I would like look at LSI and how to go about building a serach engine using it. I also know about not re-inventing the wheel but I learn by doing. I will probably use some of the code from the Perl article where I do I will mention this and any changes I make.
Basically:
1. Reduce a document using a stop list. I might go ethical here 😉
2. Removing all words that have no semantic meaning.
3. Removing words that appear in every document.
4. Removing words that appear in one document only.
The LSI model stores document representations in a matrix called a “Term Document Matrix”. The LSI model does not care where the word appears in the docuement. I personally think that this is a good indicator of how relevant multiple words in a search string are to each other. After the TDM has been made, and a result set found further ranking can be given to documents based on their location with respect to each other in the document.
LSI uses a term space that has many thousands of dimension (as many as you have documents). THis term space can be reduced using a mathematical method called Singular Value Decomposition (SVD) to collapse this high dimensional space into more manageable numbers and in the process the documents with semantically similar meanings are are further grouped together.
I really need to read more Knuth to see if I can find some pointers on how best to go about building a structure that can be easily manipulated etc. For development purposes I will probably do most of the spidering and preprocessing using Perl, its striing manipulation is second to none. When it comes to matrix manipulation and some heavy computational bits that I am expectin to see I am not sure what kind of performance hit I might get with using Perl so I will use C++ for that part of it if I can find a library for it, if not I might try a different method.
Anyway, enough idle rambling I am off to revise some Maths.
52.18 Million links found
7.17 Million unique links found