[GNHLUG] PySIG this Thursday - Beautiful Soup - HTML scraping; actual code developed before your eyes
Bill Sconce
sconce at in-spec-inc.com
Mon Oct 22 23:01:25 EDT 2007
PySIG Manchester, NH 25 October 2007
------------------------------------------------------------------------
Kent Johnson: Beautiful Soup
Us Ourselves: Python in Action
------------------------------------------------------------------------
____________________________________________________________________
PySIG -- New Hampshire Python Special Interest Group
Amoskeag Business Incubator, Manchester, NH
25 October 2007 (4th Thursday) 7:00PM
The monthly meeting of PySIG, the NH Python Special Interest Group,
takes place on the fourth Thursday of the month, starting at 7:00 PM.
Beginners' session precedes at 6:30 PM. (Bring a Python question!)
--------------------------------------------------------------------
Kent's Korner - Kent Johnson: Beautiful Soup
--------------------------------------------------------------------
"Beautiful Soup is a Python HTML/XML parser designed for quick
turnaround projects like screen-scraping. Three features make it
powerful:
1. Beautiful Soup won't choke if you give it bad markup. It
yields a parse tree that makes approximately as much sense as
your original document. This is usually good enough to collect
the data you need and run away.
2. Beautiful Soup provides a few simple methods and Pythonic
idioms for navigating, searching, and modifying a parse tree: a
toolkit fordissecting a document and extracting what you need.
You don't have to create a custom parser for each application.
3. Beautiful Soup automatically converts incoming documents to
Unicode and outgoing documents to UTF-8. You don't have to think
about encodings, unless the document doesn't specify an encoding
and Beautiful Soup can't autodetect one. Then you just have to
specify the original encoding.
"Beautiful Soup parses anything you give it, and does the tree
traversal stuff for you. You can tell it 'Find all the links', or
'Find all the links of class externalLink', or 'Find all the links
whose urls match "foo.com"', or 'Find the table heading that's got
bold text, then give me that text.'
"Valuable data that was once locked up in poorly-designed websites
is now within your reach. Projects that would have taken hours take
only minutes with Beautiful Soup."
--------------------------------------------------------------------
1st-ever PySIG development sprint: we try to write Actual Code
--------------------------------------------------------------------
Per a Challenge from Ted Roche. Viz,
"Recently, I started messing with some of the data we have
stored on GNHLUG.org, specifically, the Past Events page:
http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents
"Just like an end user, I had some questions, simple to ask,
tough to answer. We have attendance data for most meetings
since September of 2005. Given those dates to present,
1. What's the average monthly attendance at a GNHLUG event?
2. What's the average attendance over that period per
group/chapter/SIG?
3. What's the most popular meeting?
Most popular for each group?
4. What's the trends in attendance? Up, down?
"It seems like an interesting real-world problem: scrape a
web page of questionable HTML, interpret dirty data (not all
groups are groups, not all atttendance numbers are numbers),
dump it into a database (or perhaps a spreadsheet?) and do
the calculations. Pretty graphs get extra points.
"I think the results would be interesting, and the process
of getting to the results interesting, too: presenting how you
take on the problem, what tools you use, how much code is
needed, would make a fun meeting not only inside the SIG,
but to LUG meetings as well."
--posted to PySIG mailing list 26 Sept 07
And the group thought so too... so here we go! We'll throw code
up on the screen, develop the screenscraper Ted envisions (or as
much of it as we can get done in the available time) and publish
the results. A real world test of RAD, as empowered by Python
(and its "batteries-included" libraries -- in this case Beautiful
Soup).
All are welcome. Come to help us code. Or come to laugh :)
Plus:
-------------------------------------------------------------------
o Kent's Korner The Real Stuff! - Kent Johnson
This month: Beautiful Soup (see above)
Upcoming Kent's Korner topics:
XML parsing (ElementTree)
Profiling (timeit, prof)
o Our usual roundtable of introductions, happenings, announcements
o Gotcha contest
- Got a favorite "gotcha"? Bring it and share...
And of course, milk & cookies.
Cookies are assured, thanks to Janet. Milk also, thanks to Alex.
-------------------------------------------------------------------
6:30 Beginners' Q&A
7:00 Welcome, Announcements - Bill & Ted & Alex
7:10 Milk & Cookies - Alex & Janet
7:10 Favorite-gotcha contest
7:15 Kent's Korner (Python Module of the Month) - Beautiful Soup
7:45 Development Sprint! Web-Scraping; the Ted Roche Challenge
9:00~ Adjourn
___________________________________________________________________
About PySIG:
PySIG meetings are typically 10-20 people, around a large table
equipped with a projector and Internet hookups (wired and
wireless). We encourage laptops and a hands-on seminar style.
The main meeting starts at 7 PM; officially we finish circa 9 PM.
Everyone is welcome. ("Membership" is anyone who has an interest
in the Python progamming language, whether on Microsoft systems
or Linux or OS X; or cell phones, mainframes, or space stations.
We have everyone from object-oriented gurus to recovering COBOL
programmers.) Tell your friends!
Beginners' session:
The half hour before the formal meeting (i.e., starting at 6:30PM)
we have a beginners' session. Any Python question is welcome --
whoever asks the first question gets the half hour! Questions are
equally welcome by mail beforehand (in which case we can announce
them) or at the meeting. (As are all Python questions, anytime.)
Mailing list:
http://www.dlslug.org/mailman/listinfo/python-talk
About Python:
"Python is a dynamic object-oriented programming language that
can be used for many kinds of software development. It offers
strong support for integration with other languages and tools,
comes with extensive standard libraries, and can be learned
in a few days. Many Python programmers report substantial
productivity gains and feel the language encourages the
development of higher quality, more maintainable code."
"NASA uses Python...so does Rackspace, Industrial Light&Magic,
AstraZeneca, Honeywell, and many others."
Google: "Python has been an important part of Google since the
beginning, and remains so as the system grows and evolves."
-Peter Norvig
http://www.python.org
About Amoskeag Business Incubator:
Our gracious hosts are the Amoskeag Business Incubator, an
organization providing a supportive entrepreneurial environment
that stimulates the growth of businesses to ensure economic
vitality and encourage job creation, by providing affordable
office space and technical assistance to early stage companies.
PySIG thanks the ABI for their generous hospitality.
http://www.abi-nh.com
_______________________________________________________________________
Directions (thanks to Ted Roche for improvements to "from the north"):
PySIG NH meetings are held at the Amoskeag Business Incubator,
33 South Commercial Street, Manchester, NH.
Coming in to Manchester using I-293, from the north:
o Use Exit 6 from I-293. Stay to the right on the ramp,
yield twice to traffic incoming from the left, cross back
over I-293 and accept one merge coming in from your right.
o Then get in the right lane, and stay there, over the river,
and onto the Canal Street exit ramp.
o Take the first right off Canal Street onto North Commercial
Street. Enjoy the scenic mill buildings as the street turns
into Commercial Street.
o Coming to the traffic light get in the middle lane. South
Commercial Street starts on the other side of the light.
You go straight through (and join the folks coming from the
south at step * below).
Coming in to Manchester using I-293, from the south:
o Use the Granite Street exit. Turn right (east). Go under
I-293 and cross the bridge over the Merrimack River.
o Turn right (south) at the first light after crossing the
bridge.
* This is South Commercial Street. Go past one parking-lot
entrance, turn right into the second one. 33 Commercial
Street will be right in front of you. You may go in via
either the ramp or the door and three steps inside.
o Inside. Up the stairs if via the door. Go through the
glass doors - follow the diamonds on the floor. Go left
at the last diamond. (Under a sign which says
"<- Amoskeag Small Bus. Incubator").
o More diamonds, another sign... much glass and office
space for SNHU; turn left there, 4 more diamonds and
you're at the glass doors for the Incubator. An "abi"
sign is above.
o Through the doors, straight down the hall. The ABI
Conference Room is on the left.
________________________________________________________________________
$URL: svn://svn.in-spec-inc.com/isi/trunk/isi/opages/pysig.announcement $
$Id: pysig.announcement 1570 2007-10-23 02:44:57Z sconce $ $Rev: 1570 $
_______________________________________________
gnhlug-announce mailing list
gnhlug-announce at mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-announce/
More information about the gnhlug-discuss
mailing list