Wanted: OSS to monitor changes to websites or diff HTML files
Larry Cook
lcook at sybase.com
Thu Feb 3 08:18:01 EST 2005
Is anyone aware of Open Source projects that monitor websites for changes or
diffs HTML files such that tags and M$ inserted crap is ignored?
My search of SourceForge and FreshMeat didn't turn up anything that met my
needs, which are:
1) Identify new pages
2) Identify changed pages
3) For changed pages, indicate changes
The few non-open source products and web-based services I looked at don't even
do all I want. They seem good at highlighting changes, but you have to
specify the pages to watch. They won't identify new pages.
Since I didn't find anything, I wrote a script that uses wget to download the
website and diff it with a previous download. This easily identifies new
pages and changed pages, but the diff of HTML files can look horrendous. So I
wrote a filter script that strips out HTML tags. This is better, but since it
just blindly takes out the tags (using the regex '<[^>]*>') it leaves a lot of
M$ crap. Of course, I can work to improve my filter, but if there is already
something out there, I rather use it. A utility that diffs HTML files that
can ignore everything but text, or a utiltiy to convert HTML to text would be
great.
Thanks in advance for any suggestions.
Larry
More information about the gnhlug-discuss
mailing list