I was writing a custom built parser for the search engine but Mark Fowler asked me why I was not using Yacc and friends. At the time any reference I had found to parsing HTML had said that Yacc was not the correct tool and although possible for strict html etc it does not fit well in the real world were most html is rubbish.
I decided to see what flex could offer instead and I can now safely say that flex is the exact tool that I was looking for, “thankyou Mark”. Admitedly I am asking flex to do a lot more than just spit out tokens etc but what it does suits my needs . It gives me the ability to generate arbitrary code dependant upon state while parsing a document which is Exaclty what I need.
I wrote a simple parser using flex++ and then decided to see how performance would compare against the perl module HTML::Parser. Hands down HJTML::Parser is faster and not by any short margin. I have not looked into why this is. I am now going to write one in plain C and avoid C++ to see how much difference we can get.
Well! It took me about 5 minutes to change the flex++ parser back to a C parser and the performace improvement is quite drastic. I imagine this is because of the way flex++ is generating the lex.yy.cc file because C++ is only margianlly slower than C and some would argue it is faster.
I did a bit more playing around with the C version and I think I might be able to make it faster than the HTML::Parser version (of course it won’t be half as functional).
Testing was done as follows
file big.html == 10Mb of reasonably formatted HTML
perl HTML::Parser (with XS) == 0.5s
flex == 0.7s
flex++ == 8s
I am no guru at flex so I imagine that it could be a lot quicker once I get to grips with it.
Dear Harry,
Can you share the code of your flex-based HTML parser?