Pathetic Python Blogging

Dear Lazyweb – can anyone work out why I can’t get useful data out of this page with BeautifulSoup and Python 2.5?

The information is in an HTML table, enclosed by td tags nested in tr tags, and governed by three CSS classes, “flight-data”, “data-head” and “data-row2”. The latter pair are used only within the first. So you would think something like this would work:

for item in soup.findAll('td', {'class': 'flight-data'}):

The ellipsis is there to make the indentation obvious in this post. Where soup is naturally an instance of BeautifulSoup that’s been fed the webpage as a file-like object. But it doesn’t; it does grab some of the data, but it also grabs much of the webpage as raw html, including the header and a gaggle of javascript. And it’s slow, dammit. I can’t be too far off beam, because I’m successfully parsing another very similar website using a near-identical parse command.

I’ve tried various interlocking restrictions, and searching for both data-head and data-row2, but these usually find nothing.

  1. arvind1

    [[td.string for td in tr.findAll(‘td’) if td.string] for tr in soup.findAll(‘tr’, {‘class’: ‘data-row2’})]

  2. arvind1

    oh yes, as of speed, first:

    soup = soup.find(‘table’, {‘id’: ‘dgArrivals’, ‘class’: ‘flight-data’})

    (only ‘id’ is enough, though)

    if you need more speed, you’d want to use lxml.

  3. Alex

    Hey, I tried that; it eventually sporked the python interpreter, not before producing reams of unparsed html.

  4. Alex

    OK; slight change; try again – works!!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: