HTML scraping in python

Thu Jun 11 08:59:01 EDT 2009

Lloyd Kvam <python at venix.com> writes:

> I assume you want a dict for each row.

Yes, with the column headers as the keys.

> I have not seen a table extract module.  BeautifulSoup is a third party
> module that is usually effective in dealing with any HTML.  Hopefully
> the table is reasonably simple with no colspan/rowspan attributes and
> funny data mixed in.

I stumbled up BeautifulSoup and am now trying to get that and the
mechanize module installed.  However, mechanize seems dependant upon
ClientForm, and I can't figure out how to get the ClientForm*.egg
installed.  I placed it in sys.path, but it's not getting picked up, I
tried to manually test that it would work using pkg_resources and
require(), but got this:

  $ python
  Python 2.6.2 (r262:71600, Apr 16 2009, 09:17:39) 
  [GCC 4.0.1 (Apple Computer, Inc. build 5250)] on darwin
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import pkg_resources
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  ImportError: No module named pkg_resources

When I look at sys.path, it seems as if it knows about mechanize, but
not ClientForm, despite having copied ClientForm there:

  $ ls -1 /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/
  ClientForm-0.2.10-py2.6.egg
  README
  easy-install.pth
  mechanize-0.1.12.dev_r62424-py2.6.egg

> Are the column headers in th tags?  Can you use the headers to create
> field names?  (e.g. fieldname = '_'.join( head.lower().split() )

I think so.  I'll try that as soon as I can get mechanize,
BeautifulSoup, and ClientForm installed and working correctly :)

> I've got to run (a funeral), but am happy to help.  I'll check my email
> when I get back.

Thanks!

-- 
Seeya,
Paul