HTML scraping in python
Paul Lussier
p.lussier at comcast.net
Thu Jun 11 12:02:20 EDT 2009
"Shawn O'Shea" <shawn at eth0.net> writes:
> There is. The BeautifulSoup docs/examples page has been invaluable to me
Hmm, I didn't find that page quite as helpful as you seem to have.
Perhaps I spend more time with it...
> the past for learning BS. Anyway, here's an example that should help.
>
> $ python
> Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from BeautifulSoup import BeautifulSoup as BS
>>>> html = "<td>data</td>"
>>>> soup = BS(html)
>>>> soup
> <td>data</td>
>>>> soup.td
> <td>data</td>
>>>> soup.td.contents
> [u'data']
Hmm, that's an interesting approach. I ended up with:
doc = open('foo.html')
htmlData = BeautifulSoup(doc.read())
tables = htmlData.findAll(name='table')
rows = tables[1].findAll(name='tr')
(I happen to just need data from the second table on the page).
'rows' ends up as list of every row in the table.
>From there, I can loop over them like this:
for row in rows:
currentRow = row.findAll(name='td')
Then, for each element in the currentRow, it appears I can do this:
for element in currentRow:
data = element.string
>From there it's a simple matter to insert that data into the database
where I want it. Though, it strikes me that there ought to be a less
manual way of doing this. Perhaps that assumes perfectly structured
html, but it seems that extracting tables is a common thing to do, and
that there ought to be something equivalent to Perl's
HTML::TableExtract module.
Of course, looking back at some of the code I wrote a while back using
HTML::TableExtract makes me cringe, so maybe this is a better way, or,
maybe, html sucks so badly that there just isn't a good general way of
doing this.
--
Seeya,
Paul
More information about the gnhlug-discuss
mailing list