Welcome to my blog. This blog was created mainly to help track my progress in Google's Summer of Code 2007, on which I've started implementing the Bug Triage and Forward Tool for the Debian Project. Currently I use it to write what I'm doing on free software projects or whatever is in my mind when I get some time/spirit to write.
Thursday, 14 June 2007
BeautifulSoup: Parsing html in Python
Parsing HTML to get the information you need can be a very hard task if you take complex pages like the ones generated by the Debian Bug Track System, which I need to do on my GSoC project while the debbugs people doesn't finish the SOAP interface. I was doing it through regular expressions, heavily based on the reportbug-ng code, when my mentor (thanks, Loïc) mentioned BeautifulSoup, a python module (with a strange name :P) to parse html.
If you ever need to parse html code in python, I strongly suggest you take a look on it. As usual with python stuff, it's very well documented, and it has a very good set of features which allows one to easily find anything inside a html document. It also has a xml module, which I haven't tried (yet).
BTW, did I already say I think GSoC is a great learning experience? Even I'm surprised by how fast I'm being able to apply GSoC-acquired knowledge in other activities, as I'm already using BeautifulSoup in another project.
No comments:
Post a Comment