Wanted: OSS to monitor changes to websites or diff HTML files

Larry Cook lcook at sybase.com
Thu Feb 3 08:18:01 EST 2005


Is anyone aware of Open Source projects that monitor websites for changes or 
diffs HTML files such that tags and M$ inserted crap is ignored?

My search of SourceForge and FreshMeat didn't turn up anything that met my 
needs, which are:

1) Identify new pages
2) Identify changed pages
3) For changed pages, indicate changes

The few non-open source products and web-based services I looked at don't even 
do all I want.  They seem good at highlighting changes, but you have to 
specify the pages to watch.  They won't identify new pages.

Since I didn't find anything, I wrote a script that uses wget to download the 
website and diff it with a previous download.  This easily identifies new 
pages and changed pages, but the diff of HTML files can look horrendous.  So I 
wrote a filter script that strips out HTML tags.  This is better, but since it 
just blindly takes out the tags (using the regex '<[^>]*>') it leaves a lot of 
M$ crap.  Of course, I can work to improve my filter, but if there is already 
something out there, I rather use it.  A utility that diffs HTML files that 
can ignore everything but text, or a utiltiy to convert HTML to text would be 
great.

Thanks in advance for any suggestions.

Larry






More information about the gnhlug-discuss mailing list