Wanted: OSS to monitor changes to websites or diff HTML files
Drew Van Zandt
drew.vanzandt at gmail.com
Thu Feb 3 09:20:01 EST 2005
Something that converts HTML to text, like this?
http://www.icewalkers.com/Linux/Software/51170/html2txt.html
On Thu, 03 Feb 2005 08:17:32 -0500, Larry Cook <lcook at sybase.com> wrote:
> Is anyone aware of Open Source projects that monitor websites for changes or
> diffs HTML files such that tags and M$ inserted crap is ignored?
>
> My search of SourceForge and FreshMeat didn't turn up anything that met my
> needs, which are:
>
> 1) Identify new pages
> 2) Identify changed pages
> 3) For changed pages, indicate changes
>
> The few non-open source products and web-based services I looked at don't even
> do all I want. They seem good at highlighting changes, but you have to
> specify the pages to watch. They won't identify new pages.
>
> Since I didn't find anything, I wrote a script that uses wget to download the
> website and diff it with a previous download. This easily identifies new
> pages and changed pages, but the diff of HTML files can look horrendous. So I
> wrote a filter script that strips out HTML tags. This is better, but since it
> just blindly takes out the tags (using the regex '<[^>]*>') it leaves a lot of
> M$ crap. Of course, I can work to improve my filter, but if there is already
> something out there, I rather use it. A utility that diffs HTML files that
> can ignore everything but text, or a utiltiy to convert HTML to text would be
> great.
>
> Thanks in advance for any suggestions.
>
> Larry
>
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss
>
More information about the gnhlug-discuss
mailing list