Thursday, 14 June 2007
BeautifulSoup: Parsing html in Python
Parsing HTML to get the information you need can be a very hard task if you take complex pages like the ones generated by the Debian Bug Track System, which I need to do on my GSoC project while the debbugs people doesn't finish the SOAP interface. I was doing it through regular expressions, heavily based on the reportbug-ng code, when my mentor (thanks, Loïc) mentioned BeautifulSoup, a python module (with a strange name :P) to parse html.
If you ever need to parse html code in python, I strongly suggest you take a look on it. As usual with python stuff, it's very well documented, and it has a very good set of features which allows one to easily find anything inside a html document. It also has a xml module, which I haven't tried (yet).
BTW, did I already say I think GSoC is a great learning experience? Even I'm surprised by how fast I'm being able to apply GSoC-acquired knowledge in other activities, as I'm already using BeautifulSoup in another project.
Sao Paulo's Metro Strike
I live in São Paulo, The most important (IMO) city of Brazil. It's also the fifth most populous metropolitan region of the world.
One of the main public transportation systems in São Paulo is its metro, which was once regarded as a transportation city of major quality. Lately, however, it has been sinking. Fast. Very fast. It just can't keep up with the demand; the trains are getting more and more full, and the timings are getting more irregular as time passes.
If this isn't enough; the syndicate of metro workers seems to be formed by a bunch of selfish clowns. So, today, 3,3 million of people are without transport, because these clowns want 13% of income increase.
Now, where are the laws which state that this kind of public service can't be paralyzed? The government should just send these clowns back to the circus they fled from. This brings us to another topic: the pathetic laws that regulates public/government workers . They can just work (or not work) however they want, and can't be fired. The ultimate job security here is to get into a government job.
Finally, the solution for the (metro) problems: just privatize the damn thing already.
Well, rant done, so let's go on with our daily schedules (or what is possible of it without metro, for the paulistans)
Sunday, 10 June 2007
btsutils 0.1.1
I've recently released the first version of btsutils, a python module to interact with debbugs servers (such as the Debian Bug Tracking System). The btsutils is part of my Google Summer of Code project, the bug triage and forward tool.
Currnetly, the btsutils can query the bts based on bug number, source package, package, maintainer or submitter.
A Debian package of btsutils 0.1.1 is already waiting to be processed on the NEW queue.
Some useful links:
Saturday, 2 June 2007
Python Soul
I find it very interesting how different programming languages have different styles. My Google Summer of Code project, the Bug Triage and Forward Tool, is my first Python software; and working on it on the last few days, I've got the feeling that the way I've been using to structure the code doesn't fit very well with the way python packages/namespace works. I already wished to separate the project in three independent codebases, so I'll go ahead and do that. These codebases will be:
- python-btsutils: python module to interact with the Debian BTS / Debbugs servers
- python-bugzilla: python module to interact with Bugzilla
- bug-triage: The tool itself.
Subscribe to:
Posts (Atom)