Who’s searching on what

I noticed that someone had a look for gimpy on my blog today and I was wondering what terms people are finding my site with so I ran the following over my logs
perl -ne ‘/.*google.*?&q=(.*?)(&|”).*$/; print “$1\n” if $1;’ *.log | uniq
I am sure there is a shorter and better way to do it but this was more than enough to have a quick look.

Marketing a simple website

I have spent a fair bit of time working on another website that had some of the most horrible HTML I have ever seen. I managed to actually upload the site last night and it is now live. I didn’t design the site I just converted it to HTML Transitional that validates from some Dreamweaver mess.
I have already made a few entries about this in my blog so here’s the link.
Aerospace NDT
The people at Aerospace NDT realised they where not getting enough from their website so they contacted me to see if I could do something with it. I had a look at their site and wrote up what I thought of it and gave them some advice as to what I though could be done with it to improve its visibility etc. They seemed to like what I said because I got the job.
I am basically tasked with getting their site up the google ranks which I have already done and quite substantially. I was very lucky and they were unlucky in the fact that the single greatest change required to the site so far has been the removal of the splash screen. They were unlucky in this because their last developer had left them with a site that could not be seen by the search engine because there was not a single link off the splash screen. This also meant that in certain browsers without flash they could not actually see the websites.
I have made some fundamental changes to their site during the conversion from the old one so we should see an overall increase in the google ranks but time will tell. I am keeping a tally for certain search terms to make sure that what we do has a positive affect on the site so watch this space.

Nutch and Lucene

We have been wanting a search engine at work for some time now so I started looking at Lucene. I downloaded it and got it running and doing some basic stuff but what we really wanted was something web based, ie an out of the box solution.
I suggested we try Nutch, so I spent today getting it running. Nutch itself is a piece of cake to get working, what wasn’t so easy was getting Tomcat4 working with Nutch.
After much swearing and perspiration I finally manged to get it working and it is as sweet as a nut. We indexed just over 200 word documents in a few minutes (test machine is an old celeron) and gave it a whirl. Straight out of the box solution to your search engine problems. I was very impressed. I may have more to report on this next week because we might be putting it on one of the larger servers for a trial run.

Lexicon

I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl’s
m/\w/
if it doesn’t match this it won’t go in the lexicon. I know this is a bit harsh but unfortunately I don’t have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.

Robots.txt file

There appears to be some misunderstanding surrounding the usage of the robots.txt file.
The following is just a fraction of the stuff I have found while spidering websites.
Noarchive: /
The “noarchive” statement should be part of a meta tag it should not be in the robots.txt file. Its not part of the standard.
I believe that the following or something similar should be in the standard but it isn’t yet ie “Crawl-delay”.
Crawl-Delay: 10
Crawl-delay: 1
Crawl-delay: 60
It is implemented by a few crawlers but people insist on doing the following
User-agent: *
Crawl-delay: 1
The proper way should be as follows,
User-agent: Slurp
Crawl-delay: 1
I know Yahoo’s crawler (Slurp) adheres to the Crawl-delay directive
but here we are endorsing a non-standard method, whether this is a good or bad thing is left up to the reader to decide. I think there needs to be a delay type option in the robots.txt file having been hammered once by msn’s bot.
Then we have the people who think that they need to authorize a spider to spider their website.
Allow: /pages/
Allow: /2003/
Allow: /services/xml/
Allow: /Research/Index
Allow: /ER/Research
Allow: /
The reason for not having an Allow directive is simple. Hardly any of the internet would be indexed becasue only a fraction of the websites online actually uses the robots.txt file. By implementing an Allow directive it would mean that websites are closed for business to the spiders. For instance, take the following directive
Allow: /index_me/
is the spider then to assume that only that directory is available to the spider on the entire website, what can the spider assume about the above directive. To me it reads that only the “index_me” directory is to be indexed. What then is the point in the Disallow directive.
The Disallow directive was chosen because the internet is for all intents and purposes a public medium so we all opt in when we put our websites up then we opt out of the things we don’t want indexed.
my favorite though are the following. The honest mistakes
Diasallow: /editor/
Diasallow: /halcongris/
Disallow /katrina/
Diasallow: /reluz/

Show me NOW

There is an ever increasing amount of information on the internet, this fact appears obvious in the extreme but what might not be so obvious is the ever increasing amount of duplicate information.
Have you ever tried looking for “man find unix” on google, nearly every page displayed has the same information. I know that some pages are slightly different but its becoming increasingly difficult to find what you are actually looking for. This is not an isolated incident, most search engines are suffering.
Google was a fantastic leap in the right direction but has anything changed in the last 2 years that visibly makes a difference to the layman, I haven’t seen it, have you?
Everyone assumes that the more pages a search engine has in its database the better the search engine. As popular as this school of thought is, its wrong! very wrong!! Why?
Up until 2 months ago I used Google exclusively and recommended it to everyone who wanted to find something on the internet. Just recently I have found that Google is not providing me with the goods. I have often caught myself switching to Yahoo in order to find what I am looking for, I have even went as far as Looksmart and got better results.
At first I considered these breaches from the one true search engine as isolated anomalies arising from the eclectic nature of the topics I was researching but empirical evidence suggests otherwise. I am now going to Google and trying a search and then going straight to Yahoo and getting what I am looking for. Am I a heretic to suggest that Google is just not cutting it any more, quite possibly, but I am not the only one.
As much as I love Google, it appears to be slipping compared to other engines. However, I will not give up on it because unlike most search engines I actually trust Sergey Brin and Lawrence Page to act in the best interests of the users. Maybe I am being naive in thinking Google will retain their morals in light of going public but hey, I’m an eternal optimist.
There goes any chance of ever attaining my dream job 😉