HTML scraping in python

Thu Jun 11 11:00:36 EDT 2009

On Thu, Jun 11, 2009 at 10:20 AM, Paul Lussier <p.lussier at comcast.net>wrote:

> Paul Lussier <p.lussier at comcast.net> writes:
>
> > I stumbled up BeautifulSoup and am now trying to get that and the
> > mechanize module installed.
>
> Okay, I've got that installed.  I've figured out enough BS to get me a
> single row of the table into a list comprised of elements like:
> '<td>data</td>'
>
> Now I just need to figure out how to strip the html off of the data.
> I could do it by writing a regexp, I suppose, but I'm hoping there's a
> method which already does this.
>

There is. The BeautifulSoup docs/examples page has been invaluable to me in
the past for learning BS. Anyway, here's an example that should help.

$ python
Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup as BS
>>> html = "<td>data</td>"
>>> soup = BS(html)
>>> soup
<td>data</td>
>>> soup.td
<td>data</td>
>>> soup.td.contents
[u'data']
>>>

-Shawn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20090611/a0066b83/attachment-0001.html