Nutch and Lucene

We have been wanting a search engine at work for some time now so I started looking at Lucene. I downloaded it and got it running and doing some basic stuff but what we really wanted was something web based, ie an out of the box solution.
I suggested we try Nutch, so I spent today getting it running. Nutch itself is a piece of cake to get working, what wasn’t so easy was getting Tomcat4 working with Nutch.
After much swearing and perspiration I finally manged to get it working and it is as sweet as a nut. We indexed just over 200 word documents in a few minutes (test machine is an old celeron) and gave it a whirl. Straight out of the box solution to your search engine problems. I was very impressed. I may have more to report on this next week because we might be putting it on one of the larger servers for a trial run.

ICANN & IWILL

What planet are ICANN transmitting from!
They have decided to change the policy on transfering domains ie if you are unable to respond to the transfer request and deny it withing 5 days the transfer goes ahead. What does this mean and why is it bad.
I am the sole contact for all of my domains which means if I was on holiday and someone initiates a transfer request and I don’t respond which I won’t because I am on Holiday I get back home and my domain has been given to somone else. The same thing would happen if I was in hospital. For those non techs out there the following is a good analogy.
You decide you would like to rent in London so you have a look around and get yourself a nice property and sign a contract for 2 years with a first option to extend if you want. You pay your deposit and move in. Its great people learn where you live they know where to find you and your little falt becomes prime location. Having the option to always rent this flat is also great because you want to stay.
Then one day you go on holiday and someone who wanted the flat decides to move in, under current rules they cannot. Under new rules if they knock the door and there is no reply for fives days they are able to break the lock and move in.
So when you get back someone has moved into the flat you spent so much time on and there is not a thing you could do because you didn’t answer the door.
This is absolute nonsense and I can only assume ICANN are doing it because there is some way to make some money from all the court cases which are going to appear when the fraudsters start trying to snatch domains that they shouldn’t have.
Luckily for me I use 123-reg.co.uk which posted me the following today:
Dear Customer,
On 12th November ICANN will introduce a new policy designed to make
transfers of non-UK domain names between Registrars quicker and easier.
From this date, if there is no acknowledgement from the domain
owner/admin contact within 5 days of a transfer request being made, the
transfer will automatically take place.
While a great step forward in ensuring domains can be freely
transferred by their owners, 123-Reg is concerned that this new system
could make it easier for your domain to be fraudulently transferred
away from 123-Reg. We would like to reassure you that we are taking
steps to guard against this happening to you. From the 12th, therefore,
all your non-UK domains registered with us will be automatically locked
so that only you can unlock them and initiate a transfer.
The new system will not affect your ability to manage your domain in
the usual way, and will simply mean that should you wish to change name
servers or transfer a domain away from 123-reg you will first need to
unlock it. This can be done quite simply from your 123-reg Control
Panel.
As we will be unable to accept liability if you unlock your domain and
an unauthorised transfer results, we strongly advise that you make sure
domains are kept locked at all times except when absolutely necessary
to change name servers or initiate a transfer.
Best Wishes,
The 123-Reg Team
Thankyou 123-reg for protecting me from the idiocy of ICANN which should now be named ICANN&IWILL.

Diligent Editing of HTML

I am a fan of standards ie XHTML Transitional/Strict etc. To this end I do try to make sure that I am keeping my own sites reasonably compliant. Sites I do commercially are always 100% compliant but thats because I insist on it and they have placed their trust in me.
Just recently I have had to convert a really bad site to XHTML Transitional and if you had seen the markup you would have realized how big this task was. To go through it by hand would have been an enormous task and quite frankly I would have been unable to do it at the price I quoted without the following tools:
1. Vim ( Braam Moolenaar )
2. Template Toolkit TT2 ( Andy Wardley )
3. HTML Tidy (Dave Ragget)
4. W3C Validator ( The W3C Validator Team )
The first tool (Vim) could really be any good text editor ie Emacs, ed, or any of the vi children. I just happen to use Vim and once you have learned the basics joy to use and makes editing text almost an art.
TT2! the second tool is slightly more specialized and less well known but just as easy to use, but it deserves a big mention. TT2 is a templating system. Most people won’t really understand or even need to know what the advantages of this is until they need to edit a 10+ page website and hate it when someone wants to change a font on some item on all the pages. This could of course be done using server side includes or some other method but TT makes this easy but also exposes a programmatic API which make its functionality and versatility as wide as the programmers skills. This only scratches the surface of what TT can actually do for you.
The third tool is Dave Raggets HTML Tidy. This one tools is what saved me from going stark raving mad this weekend. Visually selecting an area in vim and then
‘<,’>!tidy -asxhtml -icbq -wrap 100
was what kept me sane. This single command will take ANY html fragment and sanitize it for you. It adds a lot of guff that you may not want but you can remove that and you have a sanitized version complete with CSS.
I just wanted the formatting, indenting and validation. I weeded out the CSS and I was left with a nice plain HTML document that I was then able to understand rather than some debauchery of a mess the devil would not have started with.
Using Tidy this way is a great way to get a clear place to start when converting a messy HTML page.
Last but not least is the W3C’s validator pages for both CSS and XHTML. After all the grunt work is over its time to check the pages and using the methods above I managed to come in with:
Out of 29 Pages:
20 html errors
2 css errors
this took me about 30 minutes to fix!

HTML Validation

I’m fairly lazy when it comes to validating my own site. I mean, who can be arsed making edits and then validating them all every time 😉
I know there are plenty of people who do it but I am not one of them. I normally check to make sure that it looks OK and thats about it. I am not even that concerned about displaying in Internet Explorer ( I have minimal real visitors a month and the rest is blog spam touting Viagra ). This is because I use Debian almost exclusively at work and at home and it is a major pain in the ass to check the windows side of things.
What I have tried to do is be quite strict with myself when I am making edits to my website. What this has resulted in is:
I checked 18 pages of my website and found 5 errors (all silly) all of which were on one page and caused by character references.
For those that have used the W3C validator this is not bad going at all. I know the purists will still think this is crap and that all HTML/XHTML should validate all the time. I believe this would be great too but unfortunately some of us have a life to lead outside the webosphere.
For those that always mean to get around to validating their websites but never do then my final word on HTML Validation is this:
“If you can’t validate religiously, at least edit diligently”
How can I say this. Well it takes more skill to get it right first time than to correct it after you have been shown your mistakes!!!

Dreamweaver is shit

Or at least my perception of it has been tainted by a website I am attempting to maintain that has been bolted together using dreamweaver. Note I did not use the word “constructed” or “built”. I prefer bolted because its a mess.
First:
Javascript is everywhere, most of which is bug ridden crap. Its being used to load images and has replaced the humble “link” on half the website. This has meant Google cannot see half the website which from a business point of view is critical. If the search engine cannot see your website then no one will find it!
Second:
Images everywhere. Every time a page was requested over 40 images were requested from the server. This is mad, on what appears to be a plain text website. with no adverts. 25% of the images happened to be used as 1 pixel spacers. This is absolute madness!
MAD MAD MAD BLODDY MAD

Lexicon

I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl’s
m/\w/
if it doesn’t match this it won’t go in the lexicon. I know this is a bit harsh but unfortunately I don’t have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.

What is a DSO Exploit

I have noticed that a few people came here to find out information on what exactly a DSO Exploit is so I put together the following. If you need more leave me a comment and I will see what I can do.
Most of you are wondering why spybot is reporting a DSO Exploit. First, there is a bug in spybot at the moment that means it will always report this error. The bug will be fixed in a newer version of spybot.
Don’t panic, your system may be as clean as a whistle.
What is a DSO Exploit.
DSO stands for Data Source Object. So a Data Source Exploit can be very severe when you consider your hard drive is a data source or pretty much anything else for that matter and can be accessed using a method called data binding. A DSO Exploit is where someone maliciously uses data binding techniques to gain access to material they are not meant to access. This was a bug in some versions of Internet Explorer, Outlook Express etc. Note I said old versions, the new versions no longer have this problem and I suggest you upgrade to these to avoid the bug.
This does not mean you have to go and buy the latest microsoft software. Microsoft release service packs that come with the necessary patches required to fix this problem so get the latest service pack for your system and install it and you will be safe from this particular bug, or at least until some smart arse finds another way to crack it
To stop SpyBot reporting the error do the following
Open SpyBot in advanced mode
Select: Settings
Select: Ignore Products
On the “All Products” tab scroll to “DSO Exploit” and check it.
sorted.

Secure Email Obfuscator / Catchpa

I am sick to death of spam! Its a pain in the ass but there does not seem to be much we can do about it.
One thing spammers do is harvest emails from the internet. This is surprisingly easy to do because people want to put their emails online and its very easy to write a spider. This is further compounded by the requirement of some applications that you put your email online ie Movable Type is one although it can be turned off but this then means I would get more spam.
One partial solution would be to store emails as a security image. I wrote this utility to create my own email images as png’s. If other people find it useful then I will extend the functionality of it. I know it is not unbreakable as has been proven by those clever cloggs at Berkeley who have managed to crack the gimpy catchpa.
There are a few other methods to do this but the hardest ones to crack are those that use some form of Catchpa mine is almost there but has a bit to go. Any suggestions welcome.
Using a secure image is another way to make it harder for spammers to collect emails.

The image above was created by the email obfuscator tool.
The following email was not produced by me but by BobG. One of the guys who left a comment. I have displayed the image here because I don’t want to allow images to be displayed in the comments otherwise we would give the blog spammers another avenue of attack. Anyway, Bob suggested I add the facility to create color for the background so I suppose I will have to do this over the next few days.

Continue reading “Secure Email Obfuscator / Catchpa”

Creating a Debian Package

Ever since I started using Debian I have meant to try and create a debian package for various reasons,
a) I am just a curious bugger.
b) I am just a curious bugger.
today was my chance to have a go and see what exactly it involves or should I say what it involves to create a very minimal package.
The reason for this is that I have written a Content Management System that uses about 20 Perl modules some internal some external and rather than worry when it comes to the install or an upgrade we have decided to stick the whole thing in subversion and then wrap each release in a debian package. We have several sites to run this from so the more we can automate the better particularly if I can get the automatic testing sorted. Using subversion and the debian packages the whole system should be relatively low maintainance as far as upgrades are concerned and this is important. We don’t want to scrub ourselves into the upgrade corner and find we have neither the time or the budget to spend time on an upgrade. We want it automated for us and although it might be a pain in the arse to put in place it will pay dividends when we come to change things later.
We also have a postgres schema and some config files that need taking care of but as I found out today this is relatively simple using the debian packaging tools.
I suppose I should write a simple howto about how I did it because using the debian new maintainers guide is not really the best tutorial for those wishing to package their own application for internal use. I imagine there are some other tutorials around but I didn’t find them.

Spiders

I have not been able to do any work on thesearch engine for quite a while due to commitments with maths etc but I now have some free time so I have restarted spidering again.
What I am aiming for is about 100 million pages as a base to start working on. I am probably going to impliment the whole thing using Postgres because I do not have the time to write the software requred to handle the storage (the files are stored as flat files on disk) its the meta data that I will be storing in Postgres. I will let Postgres do all the nitty gritty work so that I can concentrate on the ranking and search algorithms.
I am also looking at just using plain text ie splitting out the html completely and rely in the text and not the formatting of the document to rank each document. The reasons for this are:
a) It is much much simpler. I started writing an HTML parser in flex and believe me its a pain in the ass.
b) Plain text is also where the information is and it is this that I am interested in. Dealing with the formatting is not something I want to have to deal with. I intend to store each document raw to disk incase I change my mind later though 😉