I have been working on a Distributed Search Engine project as a hobby for about a year now and decided to put up a few more notes about it. I am starting to gather up quite a bit of information on the site some of which is better than others.
Free Pages
I had another bright idea the other day. Please have a look at
this free page website. I am interested in any feedback about it that anyone has.
Google Page Rank
I have been meaning to throw together some thoughts about Google Page Rank etc for some time and I finally got around to it tonight.
Google Page Rank Explained
More flex
I have been doing some more work on the HTML parser to see if I could improve the speed a bit. I decided to change yyin to read from a file rather than stdin and this has made quite a difference to the speed. It is now faster than HTML::Parser (but its not as functional or tidy)
My next task is to either find a good extendable hashing library for C or call C++ std libs from C. I have never had to do this before so it could be fun. I need to be able to use either C’s equivalent to the C++ standard map and non standard hash.
So far I have not had much luck finding a C hash library that fits the bill..
html and flex
I was writing a custom built parser for the search engine but Mark Fowler asked me why I was not using Yacc and friends. At the time any reference I had found to parsing HTML had said that Yacc was not the correct tool and although possible for strict html etc it does not fit well in the real world were most html is rubbish.
I decided to see what flex could offer instead and I can now safely say that flex is the exact tool that I was looking for, “thankyou Mark”. Admitedly I am asking flex to do a lot more than just spit out tokens etc but what it does suits my needs . It gives me the ability to generate arbitrary code dependant upon state while parsing a document which is Exaclty what I need.
I wrote a simple parser using flex++ and then decided to see how performance would compare against the perl module HTML::Parser. Hands down HJTML::Parser is faster and not by any short margin. I have not looked into why this is. I am now going to write one in plain C and avoid C++ to see how much difference we can get.
Well! It took me about 5 minutes to change the flex++ parser back to a C parser and the performace improvement is quite drastic. I imagine this is because of the way flex++ is generating the lex.yy.cc file because C++ is only margianlly slower than C and some would argue it is faster.
I did a bit more playing around with the C version and I think I might be able to make it faster than the HTML::Parser version (of course it won’t be half as functional).
Testing was done as follows
file big.html == 10Mb of reasonably formatted HTML
perl HTML::Parser (with XS) == 0.5s
flex == 0.7s
flex++ == 8s
I am no guru at flex so I imagine that it could be a lot quicker once I get to grips with it.
Free Websites
I know it has already been done but I have been thinking about getting my own box hosted. The reason for this is that I seem to be using more and more resources on the one I am currently on and I don’t want to be taking the piss since its hosted as a group share box.
This would enable me to do some things that I cannot currently do on the box I’m on at the moment, for instance.
I was thinking about setting up a similar facility to Geocities where people can create an account and they get some disk space to create their own website. The only requirement would be that they display some ads on their site. Other than that it would be free.
If anyone would be interested in this then leave a comment or email me.
M203 TMA04
This was the analysis block of the course and so far it has been the easiest I have done. I think this is because it was one of the more enjoyable blocks I have done on M203. We are now studying Groups and so far it is looking OK but I will reserve judgment on this until the TMA has been and gone
Xmlresume Collaboration
I was approached today about a possible collaboration between xresumes.net and xmlresume.org. I have asked what type of collaboration and am now waiting on a reply.
Miragorobot
I have seen plenty of robots spider my site but by far the most polite as far as time goes has got to be the miragorobot. It waited a full minute between each page request which is almost unheard of.
New Perl Module
I have just registered another Perl Module or at least I have asked if the name is correct / suitable.
Text::Convert::ToImage
I have been wanting to write another module for a while but CPAN is big and its getting hard to find a suitable idea that has not already been done in some way already. I had a good look around because I am not a fan of re-inventing the wheel and simplest method of doing what I wanted is by using PerlMagick. PerlMagick is not a simple module so I decided to write my own wrapper to it to simplify things a bit for this specific task hence the new module.
Just hope someone finds it useful.