External Monitoring and Alerting

Thu Mar 4 14:36:39 EST 2010

On Thu, Mar 4, 2010 at 1:56 PM, Kenny Lussier <klussier at gmail.com> wrote:

> HTTPS POST is the method that we need to use to test our systems
> availability. However, what we are testing is more than just web site
> availability or performance. It would actually be testing into an
> application, gauging response times and response content. We also need
> the ability to identify the IP addresses that the tests are coming
> from for security reasons.
>

I don't think altersite.com would keep you from achieving that goal.

>
> Also, I have noticed that everyone seems to offer either a 15-minute
> or a 5-minute test interval. Is that really the most that is needed?

5-minutes is a common default test interval.  More often than that can add
up to a lot of data that you need to keep over time.  OpenNMS has a nice
poller that polls every 5 minutes until it sees a failure, and then more
frequently until it sees success.  All is configurable.  I have not heard of
this in other systems, but I expect must be some others that support this by
now.  In any case, you need to figure out what intervals are appropriate for
your needs, but 5 minutes is a reasonable place to start from if you don't
want to think too much about it.

> I
> would think that a higher frequency would be better, seeing as how 5
> minutes is beyond the "five 9's"  uptime that everyone strives for.
> With a home-grown system on VMs, you could test every 30 seconds or
> so.
>

5 9's is fairly atainable over a varying period of time for most systems,
but is very difficult to tract and nearly impossible to prove.  For example,
even your suggested frequency of 30 seconds would not be sufficient to
confirm 99.999% up time for a month because you would have to prove that you
had less than 25.92s of down time.
30days*24h/day*3600s/h*(100%-99.999%)=25.92s.

Of course there are all sorts of ways to define uptime.  Phone companies
typically get their uptimes from how often some one goes to use the system
and can't because of some system failure rather than how much time it is in
perfect operation.  Marketing folks typically pick some reasonable polling
frequency and assume the system is up inbetween, and/or trust other sources
like system logs that would indicate any failure/recovery.  While not proof
in a mathmatical sence, it is perfectly legal as long as all stake holders
agree on the "good-enough" definition going into it.  You just have to be
careful about how you word your SLAs, etc.  Leave nothing to assumption. =)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20100304/b51e031b/attachment.html