Notes from PySIG, 25-Oct-2007, Kent Johnson, Beautiful Soup, the first PySIG sprint!
Ted Roche (Personal)
tedroche at comcast.net
Fri Oct 26 20:52:35 EDT 2007
Thirteen people elected to skip watching the second game of the World
Series (Go Sox!) to attend the October meeting of the Python Special
Interest Group (PySIG), held as usual at the Amoskeag Business Incubator
in Manchester, New Hampshire, on the fourth Thursday of the month, 7 PM
until... well, 10 PM last night!
The usual slew of announcements was made: the PySIG won't meet on the
usual night due to the Thanksgiving holiday. A meeting might happen the
week after, since there are five Thursdays. Stay tuned for the official
announcement. Other affiliated GNHLUG meetings are posted to
http://www.gnhlug.org and all are welcome.
I had proposed a programming challenge to PySIG: following recent
discussions on the GNHLUG mailing lists about attendance at meetings,
Jim Kuzdrall had suggested we more closely analyze the attendance data
that's been posted to the GNHLUG wiki [1] for the past two years or so.
The data is accessible from there, but the HTML format is not too easy
to manipulate into an analyze-able format.
Enter BeautifulSoup. BS is a utility written in Python that parses HTML,
with a lot of toleration for somewhat malformed HTML, and produces a
parsed tree that can be traversed or queried or parsed into its various
elements. Kent S Johnson continued his great Kent's Korner[2] series
with a presentation on the basics of using BeautifulSoup [3]. Kent noted
that the documentation on BS is remarkably good, with illustrative
examples and exhaustive discussions. BS is in its third major version
and continues to be supported by its original author.
After Kent's Korner, Bill Sconce took the driver's seat, set up BS on
his machine and we began with the kernel of source Kent had supplied to
parse the source. The group participated, suggested, yelled at typos,
experimented, threw out code, started over, changed the angle of attack,
and successfully produced code that not only parsed the existing page,
but generated a comma-separated-value file with proper escaping, thanks
to the csv module. Along the way, we discussed issued of character
conversion (since BS uses the aptly-named UnicodeDammit module and csv
wants ASCII), escaping issues, coding styles, and more.
At the end of the presentation, Kent got the projector again to show a
somewhat different tack he had used to parse the HTML, with an emphasis
on writing small functions to clean each column of the idiosyncracies
found in the data (a "Saturday" in the date field, a date field a
two-day event, approximated attendance ~24 and so forth) and generate
some results: which groups had the highest attendance for the year? No
one was surprised that Nashua/MerriLUG was #1, but who knew that PySIG
was #2? Woo-hoo! We noted that RubySIG was last, but there's a good
sampling problem: they had just started up early in the year, and a
couple attendance figures were missing.
To follow up from the meeting, we intended to merge Kent's improvements
into the group's code and generate some CSV files that we can make
available for download from the GNHLUG wiki for all to analyze, graph,
visualize and study.
Thanks to Kent for preparing his Beautiful Soup presentation, to Bill
Sconce and Alex Hewitt for arranging the meeting, to Bill again for
having the patience to type while twelve people tsk'd at him, to the
Amoskeag Business Incubator for providing the fine facilities, and to
all for attending and vigorously participating in the meeting!
[1] http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents, which
actually breaks down to:
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007,
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2006, and
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2005
Adding a "skin=print.pattern" eliminates some of the "chrome"
surrounding the content.
[2] http://personalpages.tds.net/~kent37/kk/index.html
[3] http://personalpages.tds.net/~kent37/kk/00009.html
More information about the gnhlug-discuss
mailing list