I have started the process of building the lexicon for my search engine. Its actually surprising how slow the list of words increases. This is partly due to me being quite strict in my definition of what constitutes a word. A normal search engine would need to be able to work with all sorts of arbitrary strings (I am not even considering encodings yet) but due to hardware constraints I have limited myself to Perl’s
m/\w/
if it doesn’t match this it won’t go in the lexicon. I know this is a bit harsh but unfortunately I don’t have several hundred machines in a cluster to play with like the other search engines ;). I think if I get over one million terms in the lexicon I will be doing OK.