XSLT and FOP

I was given the job of creating a pdf using FOP and XSLT yesterday and since I had never used either before to any great degree I had resigned myself to several days of googling but I managed to all but finish it today. It was surprisingly easy to get the bulk of the formatting done but due to FOP’s lack of support for widows and orphans the rest will be a bit harder.
I also had a lot of trouble trying to get “analyze-string” to work. I eventualy fudged a regex of my own using “substring” and it seems to do the trick

Parrot and IMC

We are off up to see my sister in Norfolk in a couple of days time. It should be a good laugh. I have been trying to come up with a good way to bodge objects until we get them in IMCC. The best way I have found to date is name mangling but this is a pain in the ass. I have also considered containers but although easier to impliment it will not look much like objects by the time I’m finished.

DBDI

I have been talking to Tim Bunce about a DBD Interface for Parrot and it would be extrememly nice to get one for the Parrot release in Feb. However, this is unlikely due to IMCC’s lack of objects. We could bodge it in the mean time and fix it later but fixing things is always ten times more difficulat than bodging them in the first place.

Some RSS 27 Dec 03

I have been playing with RSS for a few days and since most of the RSS that I have seen has been blogs I decided to RSS enable my plain old XHTML diary to a whizzy RSS compliant new fangled jobby. I have no other reason for doing this other than possible self promotion via my massively increased site traffic and “NOT”…..
I can hear people scream “use X or Y” do not write your own. What would be the fun in using someone else’s RSS generator. I had a look at some of the more noteworthy blogs and I noticed that there is an awful lot of commented out text in the source of the file. This seems to me to be a bit ignorant because I am paying for bandwidth and every bit counts ;-). I know thats a lame excuse but I could not help it nor could I think of a better one. To cut a long story short I used a very crude method to do it.
Using a couple of extra “span” tags I was able to come up with some compliant RSS from my blog. The joy of Perl.
The Script I used
The following script is quite rough around the ages but is gets the job done. If you have any questions about the Perl or why I just had to write my own feel free.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use URI::URL;
use XML::RSS;
use LWP::Simple;
my $base = “/hjackson”;
my $base_url = “http://www.hjackson.org”;
my $PAGES = {
“$base_url/cgi-bin/blog/december.html” => ‘htdocs/blog/december.xml’,
“$base_url/cgi-bin/blog/november.html” => ‘htdocs/blog/november.xml’,
“$base_url/cgi-bin/blog/october.html” => ‘htdocs/blog/october.xml’,
“$base_url/cgi-bin/blog/september.html” => ‘htdocs/blog/september.xml’,
};
my $STATE = { ‘intext’ => 0,
‘intitle’ => 0,
‘inlink’ => 0,
‘inspan’ => 0, };
my $RSS = { ‘link’ => “”,
‘title’ => “”,
‘description’ => “”, };
sub start_tag {
my ($self, $tag_name, $attr) = @_;
if( lc($tag_name) eq ‘span’) {
if( lc($attr->{class}) eq ‘blogtitle’) {
#print “In Span $tag_name\n”;
$STATE->{intitle} = 1;
}
if( lc($attr->{class}) eq ‘blogtext’) {
#print “In Span $tag_name\n”;
$STATE->{intext} = 1;
}
}
if( lc($tag_name) eq ‘a’ and $STATE->{intitle} eq ‘2’ ) {
#print “href = $attr->{href}\n”;
$STATE->{‘inlink’} = 1;
$RSS->{‘link’} = $attr->{href};
}
}
sub text {
my ($self, $text) = @_;
if ($STATE->{intitle} eq 1) {
#print “Title = $text\n”;
$RSS->{title} = $text;
$STATE->{intitle} = 2;
}
if ($STATE->{intitle} eq 2 and $STATE->{inlink} eq 1) {
$RSS->{title} = $text;
$STATE->{inlink} = 2;
}
if ($STATE->{intext} eq 1) {
#print “$text\n”;
$RSS->{description} = $text;
$STATE->{intext} = 2;
}
if ( ($STATE->{intitle} eq ‘2’) and ($STATE->{intext} eq ‘2’) and ($STATE->{inlink} eq ‘2’ )) {
\&create_rss();
}
}
sub end_tag{
my ($self, $tag_name, $attr) = @_;
if( lc($tag_name) eq ‘span’) {
if($STATE->{intitle}) {
}
if($STATE->{intext}) {
}
}
}
my $rss;
sub create_rss{
$rss->add_item(
‘title’ => “$RSS->{title}”,
‘link’ => “$RSS->{link}”,
description => “$RSS->{description}”,
);
$RSS->{‘title’} = “”;
$RSS->{‘link’} = “”;
$RSS->{‘description’} = “”;
$STATE->{intext} = 0;
$STATE->{intitle} = 0;
}
my ($html_page, $xml_page);
while ( ($html_page, $xml_page) = each %{ $PAGES } ) {
my $content = get($html_page);
#print “$html_page \n$content\n”;
$rss = new XML::RSS (version => ‘1.0’);
$rss->channel(
title => “Harry Jacksons Blog”,
‘link’ => “www.hjackson.org”,
description => “Just my Blog”,
dc => {
date => ‘2000-08-23T07:00+00:00’,
subject => “Harrys Blog”,
creator => ‘harry@hjackson.org’,
publisher => ‘harry@hjackson.org’,
rights => ‘Copyright 2003, Harry Jackson’,
language => ‘en-us’,
},
syn => {
updatePeriod => “hourly”,
updateFrequency => “1”,
updateBase => “1901-01-01T00:00+00:00”,
},
);
my @tags = (‘span’, ‘a’);
my $p = HTML::Parser->new(api_version => 3);
$p->report_tags( @tags );
$p->handler( start => \&start_tag, “self,tagname,attr”);
$p->handler( text => \&text , “self,text”);
$p->handler( end => \&end_tag , “self,tagname,attr”);
$p->parse($content) || die $!;
open ( FILE, “>$base/$xml_page”)
or die “Cannot open file $!\n”;
print FILE $rss->as_string;
close(FILE);
}

Spdiering the Internet 07 Dec 03

I have started to document what I have been doing to construct the spiders. It is not really a tutorial, it”s more about what I did and how I did it. I doubt it is even close to how it should be done but I am enjoying doing it and I get to research some interesting areas of information retrieval an processing while doing it so what the hell.

Finishing the Robots 03 Dec 03

I have been quite busy lately dipping my toes in various waters hence the lack of entries just lately. I have actually finished with the robots for the short term and now moved onto the Search engine part of the project.
I am enjoying building the search engine because I get to work with C++ again, which is another language I enjoy using. I like it because I feel as if I am as close to the hardware as I am when using C but have various High Level Tools at hand when I need them. I picked C++ over C because it has the STL which I have used before. I imagine that most commercial search engines are using either C or C++ for

Weeding the database 12 Nov 03

You will see that the database has been reduced in size quite a bit. I have been running out of space so I decided to do some weeding. What I have done is fix all all the Url’s that had a fragment part. Url’s come in the following format.

The fragment part of the URL is not really required by us because it indicates a position in a document. This level od granularity is not required or any use to us, we are only interested in the document itself. I wrote a simple Perl script in conjunction with a Postgres Function to weed these out. During the process I deleted all links that where found by following the original URL with the fragment. This is what has led to the reduction in total links found. If you have a look at the latest robot code you will see that I now cater for this fragment art and strip it off before requestint the document.
55.0 Million links found
11.9 Million unique links found

Re-writing the spiders 08 Nov 03

I have been very busy lately re-writing the spiders for the search engine. I have decided to write up what I did to build the spider in the vain hope that someone may find it useful one day. I digressed several times and had some fun writing a recursive one but I eventually settled on writing an iterative robot that uses Postgres to store the links. This was partly due to already having a database with several million links in it already. Please see the link above for more details. I have also managed to download a few thousand documents for the search engine, hence the increase in the links found, this was caused by me parsing the documents that I had found when experimenting with the new robots .
85.0 Million links found

Getting the vector space search engine runing 25 Oct 03

I have spent the last few days trying to get the Vector Space Search engine running. The code is in a bit of a mess at the moment but, it’s comming along. All I can say is thank god for the STL, without it I would been in for a hell of a job. I have now managed to create a Sparse Vector Space Matrix from 2397 documents. This needs to be increased before I can really start testing any weighting algorithms.
At the moment this is using 30Mb of memory. This is the max used during the entire process. I did have it running at 256Mb but this was my first round at designing the matrix. I then showed my program a copy of Knuth volume 3, it cowered in fear and its shoesize quickly dropped to a more respectable size. I am pretty sure that I could drop this even further by writing my own data structure without using the STL but I am happy with it at the moment.
I am not entirely happy with the output of the program yet becasue the inner product routine is not producing the correct output but this should be reletively easy to fix. I really need to do a review to make sure that I am not missing any points in my methodology.
I also need to try and compile a better stop list. The one I am using is not particularly good. This is a sure fire way of reducing the RAM footprint.
83.1 Million links found
10.9 Million unique links found

C++ Libraries for sparse matrix manipulations are not easy to find 20 Oct 03

After much hunting for a library that I can use to impliment an LSI search engine I have had little luck. The library that seems to be the job for this is called SVDPACK it is written in Fortran and has been ported to C++. However, it has not been ported to the humble x86 architecture. It looks like I will have to run with writing the Vector Space search engine instead.
I managed to write a C++ routine to get the output of the Perl Term Document parser. This is a very simple parser that splits all the words in the document on whitespace. I know that there are reasons for not doing it like this so if I get time I will come up with a better method later but for now it will do.
My next task now is to take the input of the C++ program and create a Term Document Matrix from that that I can manipulate easily. I need to be able to carry out the following actions and quite a few more.
1. Count all occourances of each word in the entire document set.
2. Calculate mean values for each word.
3. Come up with some method to ran words in the document matrix. This is to avoid the typical abuses that you see where website saturate pages with keywords to try and manipulate the results of a search engine.
67.6 Million links found
8.529 Million unique links found