Ever since I started using Debian I have meant to try and create a debian package for various reasons,
a) I am just a curious bugger.
b) I am just a curious bugger.
today was my chance to have a go and see what exactly it involves or should I say what it involves to create a very minimal package.
The reason for this is that I have written a Content Management System that uses about 20 Perl modules some internal some external and rather than worry when it comes to the install or an upgrade we have decided to stick the whole thing in subversion and then wrap each release in a debian package. We have several sites to run this from so the more we can automate the better particularly if I can get the automatic testing sorted. Using subversion and the debian packages the whole system should be relatively low maintainance as far as upgrades are concerned and this is important. We don’t want to scrub ourselves into the upgrade corner and find we have neither the time or the budget to spend time on an upgrade. We want it automated for us and although it might be a pain in the arse to put in place it will pay dividends when we come to change things later.
We also have a postgres schema and some config files that need taking care of but as I found out today this is relatively simple using the debian packaging tools.
I suppose I should write a simple howto about how I did it because using the debian new maintainers guide is not really the best tutorial for those wishing to package their own application for internal use. I imagine there are some other tutorials around but I didn’t find them.
Spiders
I have not been able to do any work on thesearch engine for quite a while due to commitments with maths etc but I now have some free time so I have restarted spidering again.
What I am aiming for is about 100 million pages as a base to start working on. I am probably going to impliment the whole thing using Postgres because I do not have the time to write the software requred to handle the storage (the files are stored as flat files on disk) its the meta data that I will be storing in Postgres. I will let Postgres do all the nitty gritty work so that I can concentrate on the ranking and search algorithms.
I am also looking at just using plain text ie splitting out the html completely and rely in the text and not the formatting of the document to rank each document. The reasons for this are:
a) It is much much simpler. I started writing an HTML parser in flex and believe me its a pain in the ass.
b) Plain text is also where the information is and it is this that I am interested in. Dealing with the formatting is not something I want to have to deal with. I intend to store each document raw to disk incase I change my mind later though 😉
Robots.txt file
There appears to be some misunderstanding surrounding the usage of the robots.txt file.
The following is just a fraction of the stuff I have found while spidering websites.
Noarchive: /
The “noarchive” statement should be part of a meta tag it should not be in the robots.txt file. Its not part of the standard.
I believe that the following or something similar should be in the standard but it isn’t yet ie “Crawl-delay”.
Crawl-Delay: 10
Crawl-delay: 1
Crawl-delay: 60
It is implemented by a few crawlers but people insist on doing the following
User-agent: *
Crawl-delay: 1
The proper way should be as follows,
User-agent: Slurp
Crawl-delay: 1
I know Yahoo’s crawler (Slurp) adheres to the Crawl-delay directive
but here we are endorsing a non-standard method, whether this is a good or bad thing is left up to the reader to decide. I think there needs to be a delay type option in the robots.txt file having been hammered once by msn’s bot.
Then we have the people who think that they need to authorize a spider to spider their website.
Allow: /pages/
Allow: /2003/
Allow: /services/xml/
Allow: /Research/Index
Allow: /ER/Research
Allow: /
The reason for not having an Allow directive is simple. Hardly any of the internet would be indexed becasue only a fraction of the websites online actually uses the robots.txt file. By implementing an Allow directive it would mean that websites are closed for business to the spiders. For instance, take the following directive
Allow: /index_me/
is the spider then to assume that only that directory is available to the spider on the entire website, what can the spider assume about the above directive. To me it reads that only the “index_me” directory is to be indexed. What then is the point in the Disallow directive.
The Disallow directive was chosen because the internet is for all intents and purposes a public medium so we all opt in when we put our websites up then we opt out of the things we don’t want indexed.
my favorite though are the following. The honest mistakes
Diasallow: /editor/
Diasallow: /halcongris/
Disallow /katrina/
Diasallow: /reluz/
M203 Exam
Is over. Thank god I have that out of the way. I now have 3 months before it all starts again and my scheduling is back at the mercy of the Open University.
I am not too sure how I did in the exam. I missed a couple of small bits out and I think I really botched one of the last ten point questions. I know what I needed to do for it but the notation completely escaped me, bollix.
We will see just how I did in a few months time. Now I need to figure out what would be the most beneficial thing to do with all this free time.
From Here To Infinity
I just finished
From Here To Infinity
ISBN:0192832026
Author: Ian Stewart
This was tough going. It is not really suitable for people without a decent amount of maths and I have to say that a lot of the topics went clean over my head and I am meant to have a fair bit of maths beneath my belt or at least more then most. I would be hard pushed to recommend it because I did not feel as if I got as much from the book as I hoped. It would not put me off reading any of his other books I just don’t think this book was my cup of tea.
Copyright by Blunderbuss or Creative Commons
I had heard of Prof. Lessig from general browsing on the internet so I know he’s got some clout with the online community, blogosphere whatever you want to call it but I had never really taken the time to find out what he does that seems to cause such a stir. He seems to have an almost religious following in some circles so I thought that I should go and see just exactly what all the fuss is about.
I had heard of the Creative Commons before so off I went umbrella in hand to University College London’s Edward Lewis Theatre and grabbed myself a seat. I immediately recognized him because I had visited his website before I went to the talk for a general nosey.
This is just the way I heard it I am sure I have probably got some of the ideas and concepts wrong 😉
I loved the way he started his talk ie he took us back to the days when George Eastman was setting about pioneering the camera and how a law passed then enabled the camera business to flourish the way it did. He then described a few things that we take for granted ie cultural remix (first time I had heard this phrase), the act of taking something like a song and putting your own spin on it or having watched a movie how we describe it to our friends and embellish it the way we see it. This goes on every day and there are no copyrights on this and there shouldn’t be.
He then moved this onto the digital age and casually pointed out that our cultural remix which we take for granted every day was now, in part, a digital phenomenon and no longer limited by distance. Kids today are growing up in this digital age and are making friends across the world without even meeting up so our once limited cultural remix has set new boundaries on a global scale. The way we think eat and speak and go about our business is now wrapped up online in this huge boiling ménagerie of digital stuff. People are expressing themselves in ways we would not have dreamed about a few years ago ie we have a new age cultural remix going on and this is a good thing. What is not good is that we have the middle men ie the lawyers trying to stifle this from happening. The lawyers and some corporations are doing this by making vast areas of our new remix illegal. ie
“Using DDT to kill a gnat”
(from memory, used by Prof Lessig in the talk, probably slightly misquoted)
was the way Prof Lessig described it and this is wrong. It was quite clear that Prof Lessig believes in copyright and so do I but it was also clear that he does not believe in applying it blindly. The normal bluderbuss approach to copyright seems to get his goat and quite rightly so, its bloody stupid.
Anyway, the talk centered around the creative commons license and what it means to us and what we can use it for and why we need it.
At the moment everything written down is copyright to the author or creator of it regardless of whether they have stuck the big C on it somewhere. This means that everything on the internet is expressly copyright unless stated otherwise. For people who want to use something they find on the internet ie a DJ finding a sample from a song, they cannot unless they have permission from the owner of the copyright so they have to get lawyers (middlemen) involved to sort out the legal stuff and they can carry on with their mixing. What the creative commons enable us to do is release a piece of work and mark it so that people know what they can and cannot do with it without having to get the lawyers involved ie cutting out the middleman. I am all for this, its a wonderful idea.
Can I prove that its a wonderful idea, yes I can. During the talk Prof Lessig played part of a soundtrack that had been released under the creative commons license “My Life” by Colin Mulcher which was then edited by Cora Beth and the editing certainly added something to the track. It was brilliant. This is not an isolated incident either.
Anyway, I have just found some of the material from the talks online so your time would be better spent watching these flash movies than reading this.
You might also be interested in
Learning More
Creative commons website
Find CC content
George Eastman
Folding@home Plico Update
We broke the 2000 barrier today and we seem to have gained a new member today MarkF.
We seemed to have gained a little speed just recently which is good because we are now getting close to the more active teams which is were I can see our progress slowing down. What we really need are a few more processors to give us that extra boost so if you would like to joining a friendly bunch of folders then
look no further
Raising Arizona
I have just watched Raising Arizona for about the 3rd time and its still as funny as ever. All of the following are brilliant in it……
Nicolas Cage …. H.I. McDunnough
Holly Hunter …. Edwina ‘Ed’ McDonnough
Trey Wilson …. Nathan Arizona (Huffhines) Sr.
John Goodman …. Gale
William Forsythe …. Evelle
Sam McMurray …. Glen
Frances McDormand …. Dot
and of course.
Randall ‘Tex’ Cobb …. Leonard Smalls
More RSS Job feeds
We have now 19 rss jobs feeds in the UKlug database and I want more. Ideally I would like to get as many feeds as possible into the database but without screen scraping this might be difficult. If you know of a feed that I don’t already have then please email me the details.
Jobs via RSS
Some people really are missing the point when trying to use RSS to list jobs. I have noticed several sites posting the title of the job and a link but there is absolutely no description. I have seen others posting a few words of the description.
This is absolutely useless becasue it contains hardly any useful information for the person finding the information, this is assuming they find it at all. Personnally I won’t add a feed to UKlug unless there is a description with some helpful text.