Wanted: OSS to monitor changes to websites or diff HTML files
    Larry Cook 
    lcook at sybase.com
       
    Thu Feb  3 08:18:01 EST 2005
    
    
  
Is anyone aware of Open Source projects that monitor websites for changes or 
diffs HTML files such that tags and M$ inserted crap is ignored?
My search of SourceForge and FreshMeat didn't turn up anything that met my 
needs, which are:
1) Identify new pages
2) Identify changed pages
3) For changed pages, indicate changes
The few non-open source products and web-based services I looked at don't even 
do all I want.  They seem good at highlighting changes, but you have to 
specify the pages to watch.  They won't identify new pages.
Since I didn't find anything, I wrote a script that uses wget to download the 
website and diff it with a previous download.  This easily identifies new 
pages and changed pages, but the diff of HTML files can look horrendous.  So I 
wrote a filter script that strips out HTML tags.  This is better, but since it 
just blindly takes out the tags (using the regex '<[^>]*>') it leaves a lot of 
M$ crap.  Of course, I can work to improve my filter, but if there is already 
something out there, I rather use it.  A utility that diffs HTML files that 
can ignore everything but text, or a utiltiy to convert HTML to text would be 
great.
Thanks in advance for any suggestions.
Larry
    
    
More information about the gnhlug-discuss
mailing list