Wanted: OSS to monitor changes to websites or diff HTML files

Drew Van Zandt drew.vanzandt at gmail.com
Thu Feb 3 09:20:01 EST 2005


Something that converts HTML to text, like this?

http://www.icewalkers.com/Linux/Software/51170/html2txt.html


On Thu, 03 Feb 2005 08:17:32 -0500, Larry Cook <lcook at sybase.com> wrote:
> Is anyone aware of Open Source projects that monitor websites for changes or
> diffs HTML files such that tags and M$ inserted crap is ignored?
> 
> My search of SourceForge and FreshMeat didn't turn up anything that met my
> needs, which are:
> 
> 1) Identify new pages
> 2) Identify changed pages
> 3) For changed pages, indicate changes
> 
> The few non-open source products and web-based services I looked at don't even
> do all I want.  They seem good at highlighting changes, but you have to
> specify the pages to watch.  They won't identify new pages.
> 
> Since I didn't find anything, I wrote a script that uses wget to download the
> website and diff it with a previous download.  This easily identifies new
> pages and changed pages, but the diff of HTML files can look horrendous.  So I
> wrote a filter script that strips out HTML tags.  This is better, but since it
> just blindly takes out the tags (using the regex '<[^>]*>') it leaves a lot of
> M$ crap.  Of course, I can work to improve my filter, but if there is already
> something out there, I rather use it.  A utility that diffs HTML files that
> can ignore everything but text, or a utiltiy to convert HTML to text would be
> great.
> 
> Thanks in advance for any suggestions.
> 
> Larry
> 
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
>



More information about the gnhlug-discuss mailing list