HTML scraping in python

Thu Jun 11 12:02:20 EDT 2009

"Shawn O'Shea" <shawn at eth0.net> writes:

> There is. The BeautifulSoup docs/examples page has been invaluable to me

Hmm, I didn't find that page quite as helpful as you seem to have.
Perhaps I spend more time with it...

> the past for learning BS. Anyway, here's an example that should help.
>
> $ python
> Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from BeautifulSoup import BeautifulSoup as BS
>>>> html = "<td>data</td>"
>>>> soup = BS(html)
>>>> soup
> <td>data</td>
>>>> soup.td
> <td>data</td>
>>>> soup.td.contents
> [u'data']

Hmm, that's an interesting approach.  I ended up with:

  doc      = open('foo.html')
  htmlData = BeautifulSoup(doc.read())
  tables   = htmlData.findAll(name='table')
  rows     = tables[1].findAll(name='tr')

(I happen to just need data from the second table on the page).

'rows' ends up as list of every row in the table.
>From there, I can loop over them like this:

 for row in rows:
    currentRow = row.findAll(name='td')

Then, for each element in the currentRow, it appears I can do this:

 for element in currentRow:
    data = element.string

>From there it's a simple matter to insert that data into the database
where I want it.  Though, it strikes me that there ought to be a less
manual way of doing this.  Perhaps that assumes perfectly structured
html, but it seems that extracting tables is a common thing to do, and
that there ought to be something equivalent to Perl's
HTML::TableExtract module.

Of course, looking back at some of the code I wrote a while back using
HTML::TableExtract makes me cringe, so maybe this is a better way, or,
maybe, html sucks so badly that there just isn't a good general way of
doing this.

-- 
Seeya,
Paul