Thursday 14 June 2007

BeautifulSoup: Parsing html in Python

Parsing HTML to get the information you need can be a very hard task if you take complex pages like the ones generated by the Debian Bug Track System, which I need to do on my GSoC project while the debbugs people doesn't finish the SOAP interface. I was doing it through regular expressions, heavily based on the reportbug-ng code, when my mentor (thanks, Loïc) mentioned BeautifulSoup, a python module (with a strange name :P) to parse html. If you ever need to parse html code in python, I strongly suggest you take a look on it. As usual with python stuff, it's very well documented, and it has a very good set of features which allows one to easily find anything inside a html document. It also has a xml module, which I haven't tried (yet). BTW, did I already say I think GSoC is a great learning experience? Even I'm surprised by how fast I'm being able to apply GSoC-acquired knowledge in other activities, as I'm already using BeautifulSoup in another project.

No comments: