PDF Tool

I was asked the other day if there was an easy way in Perl to join two pdf files together. The answer is that there is an easy way. It might not suit everyone but I managed to find a tool called the pdf toolkit (pdftk). Its another of those handy tools that gets filed in the magic toolbox for later retrieval.

Vim regex

I was wanting to start some comments in a large Perl module the other day and not really had to use Vim’s regexp to any great degree because I know Perl I decided to see if it could be done.
I decided to add a little bit of pod above every funcion. The following was the result Remember In vim the ^M is a control character for carriage return and can be achieved by holding down the control key and pressing carriage return.
The following regex
:%s/^sub\(.*\) {\(.*$\)/=head \1\(\)^M^MDescrition:^M^M=cut^Msub \1{\2/
changes
sub summit_sub {
stuff;
}
to
=head summit_sub()
Descrition:
=cut
sub summit_sub{
stuff;
}
The following regex
:%s/^sub\(.*\)\n{\(.*$\)/=head \1\(\)^M^MDescrition:^M^M=cut^Msub \1{\2/
changes
sub summit_sub
{
stuff;
}
to
=head summit_sub ()
Descrition:
=cut
sub summit_sub {
stuff;
}

Mythical beast or Masterpiece

Pagerank! The term gets passed about like penicillin in a brothel. The cure-all for all afflictions! The bandaid for the unmentionables. Bliss in a bottle!
Unfortunately, penicillin is no longer the universal bandaid and hasn’t been for some time. Like everything else, we’re in an arms race. The Search versus the Found and we’re in extra time with search engines at a distinct disadvantage.
It does’nt sound too terrible but there are billions at stake both in dollars and in time. Just think how much time Google has saved you over the last few years? Could we quantify this? No we can’t, it would be like trying to quantify the spoken word. Google is much more than just a handy tool when you get a bit stuck looking for something, its become the biggest research tool in existence and its worth is immeasurable. I can hear people saying that the internet is the biggest research tool, no its not, its only the repository. How can I say this?
Example:
A library ain’t worth bugger all unless ther’s some way to find things. Put a blindflod on and go to the British library. Find me a copy of the Bible, I don’t care what version, its the most common book in existence so it shouldn’t be a problem, should it?
The problem is that the internet is an unordered jumble of things and its net worth is only appreciated if we have some way to navigate it.
I want the search engines to win this arms race because it will increase our capacity to move forward on a global scale, reducing research times, aiding communication. What other technology can say they are doing all this.
But at what cost!?
Some of the more popular engines are up to all sorts of tricks to leverage advantage over each other and I think some of them might be losing site of whats putting bread on their table. For instance, its quite common for search engines to accept money from companies to have their adverts shown above and beyond everyone else regardless of the best results for the search. Much to my dismay Yahoo does this, amongst others. Is this really what we need? In an age were information is king should money dictate what we see. It’s always been the big fish that have dictated what we see eat and breathe so whats new.
Is it wise for search engines to start dictating what we find or to track what we search for. Would they go that far? Are they already there? Quite a few of the main search engine are and I don’t like it!
No adverts is one of the main reasons I use Google. I am pretty confident that the search results are the net result of some mathematical jiggery pokery and money has not been a deciding factor. If this changes then I may move and I would say that a lot of other people might move as well. Nearly everyone switched from Yahoo, Alta-vista and the other engines a few years ago because Google was the best thing since sliced bread.
However, I have found myself using Yahoo on occasion to find what I am looking for when previously I would not even have contemplated using it. Are other users finding this to be the case? If they are then we might start to see a switch from Google to engines previously considered inferior.
Google is by no means mythical but it is a masterpiece.I am however concerned that it might need some refurbishment.

Science Toys

I like to play with gadgets etc and managed to come across the following website.
Science Toys
It has some very nifty little experimental toys on it.

New Service on UKlug

I have now added the facility for users to add jobs to their own web pages It’s a simple operation of cutting and pasting two lines of HTML into a web page and you will get a feed from the database appearing. The feed can be customised with no knowledge of CSS. For those that know some css then you can customise the feed completely by writing your own CSS.

More flex

I have been doing some more work on the HTML parser to see if I could improve the speed a bit. I decided to change yyin to read from a file rather than stdin and this has made quite a difference to the speed. It is now faster than HTML::Parser (but its not as functional or tidy)
My next task is to either find a good extendable hashing library for C or call C++ std libs from C. I have never had to do this before so it could be fun. I need to be able to use either C’s equivalent to the C++ standard map and non standard hash.
So far I have not had much luck finding a C hash library that fits the bill..

html and flex

I was writing a custom built parser for the search engine but Mark Fowler asked me why I was not using Yacc and friends. At the time any reference I had found to parsing HTML had said that Yacc was not the correct tool and although possible for strict html etc it does not fit well in the real world were most html is rubbish.
I decided to see what flex could offer instead and I can now safely say that flex is the exact tool that I was looking for, “thankyou Mark”. Admitedly I am asking flex to do a lot more than just spit out tokens etc but what it does suits my needs . It gives me the ability to generate arbitrary code dependant upon state while parsing a document which is Exaclty what I need.
I wrote a simple parser using flex++ and then decided to see how performance would compare against the perl module HTML::Parser. Hands down HJTML::Parser is faster and not by any short margin. I have not looked into why this is. I am now going to write one in plain C and avoid C++ to see how much difference we can get.
Well! It took me about 5 minutes to change the flex++ parser back to a C parser and the performace improvement is quite drastic. I imagine this is because of the way flex++ is generating the lex.yy.cc file because C++ is only margianlly slower than C and some would argue it is faster.
I did a bit more playing around with the C version and I think I might be able to make it faster than the HTML::Parser version (of course it won’t be half as functional).
Testing was done as follows
file big.html == 10Mb of reasonably formatted HTML
perl HTML::Parser (with XS) == 0.5s
flex == 0.7s
flex++ == 8s
I am no guru at flex so I imagine that it could be a lot quicker once I get to grips with it.

Miragorobot

I have seen plenty of robots spider my site but by far the most polite as far as time goes has got to be the miragorobot. It waited a full minute between each page request which is almost unheard of.

New Perl Module

I have just registered another Perl Module or at least I have asked if the name is correct / suitable.
Text::Convert::ToImage
I have been wanting to write another module for a while but CPAN is big and its getting hard to find a suitable idea that has not already been done in some way already. I had a good look around because I am not a fan of re-inventing the wheel and simplest method of doing what I wanted is by using PerlMagick. PerlMagick is not a simple module so I decided to write my own wrapper to it to simplify things a bit for this specific task hence the new module.
Just hope someone finds it useful.