HTML scraping in python

Thu Jun 11 08:16:51 EDT 2009

On Thu, 2009-06-11 at 07:21 -0400, Paul Lussier wrote:
> Hi Folks,
> 
> I would like to extract a table from an HTML document and break it
> down to a dict for further processing.  

I assume you want a dict for each row.

> I've googled around a bit and found about 4 different modules that do
> html processing, but nothing on dealing explicitly with tables
> (something like Perl's HTML::TableExtract module).
> 
I have not seen a table extract module.  BeautifulSoup is a third party
module that is usually effective in dealing with any HTML.  Hopefully
the table is reasonably simple with no colspan/rowspan attributes and
funny data mixed in.

Are the column headers in th tags?  Can you use the headers to create
field names?  (e.g. fieldname = '_'.join( head.lower().split() )

I've got to run (a funeral), but am happy to help.  I'll check my email
when I get back.

> Can someone more knowledgable please point me in the right direction ?
> 
> --
> Thanks,
> Paul
> 
> P.S. I'm also looking for a job if anyone knows of anything, or needs
>      a sysadmin with great perl skills and growing python experience :)

Good Luck!

> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
-- 
Lloyd Kvam
Venix Corp
DLSLUG/GNHLUG library
http://dlslug.org/library.html
http://www.librarything.com/catalog/dlslug
http://www.librarything.com/rsshtml/recent/dlslug
http://www.librarything.com/rss/recent/dlslug