Nutch and Lucene

We have been wanting a search engine at work for some time now so I started looking at Lucene. I downloaded it and got it running and doing some basic stuff but what we really wanted was something web based, ie an out of the box solution.
I suggested we try Nutch, so I spent today getting it running. Nutch itself is a piece of cake to get working, what wasn’t so easy was getting Tomcat4 working with Nutch.
After much swearing and perspiration I finally manged to get it working and it is as sweet as a nut. We indexed just over 200 word documents in a few minutes (test machine is an old celeron) and gave it a whirl. Straight out of the box solution to your search engine problems. I was very impressed. I may have more to report on this next week because we might be putting it on one of the larger servers for a trial run.

2 Replies to “Nutch and Lucene”

  1. I want to know if this is a potential solution for a website search tool for ecommerce puroposes. Search engines can have problems with database content. I intend to translate the urls to avoid ‘?’ and ‘&’ etc. Any comments would be gratefully received
    thanks
    Paul

  2. Yes, Nutch is powerful enough to use as an ecommerce search engine. I am not sure why you need to translate the query part of the URI because Nutch handles them just fine.
    I am also not sure what you mean about search engines having problems with database content. If it is text then the search engine can index it if we can give it to them in a sensible fashion ie HTML. My own site has over 23,000 jobs in it and they are all in a database and Google, MSN, Yahoo and various others index them regularly.
    If you have data in a database then the way I have done it in the past is use Apache’s mod_rewrite module to mangle the url so that the user (Nutch) gets what apears to be a static page but its actually a page from the database. It is not normally required though.
    Hope this was some help.

Leave a Reply

Your email address will not be published. Required fields are marked *