Monitoring of Internet connection

Ben Scott dragonhawk at gmail.com
Wed Jun 6 22:10:36 EDT 2007


Hi all,

  I'm looking to monitor the reliability of our Internet connection at
$DAYJOB, from the inside out.  I'm familiar (in concept at least) with
monitoring a bunch of hosts and services with a tool like Nagios, but
here I'm only looking to monitor one link, and closely.

  I could just run ping in a screen(1) session, of course.  Indeed,
that's what I'm often doing now.  But I'm thinking a more
sophisticated approach is possible, and could yield more informative
results.

  In practical terms, I'm thinking about some kind of reverse
exponential back-off.  When things appear to be working, sending
occasional probes is fine.  But when a probe indicates trouble, follow
up immediately with more.  If it looks like things are working, slowly
back off.  If things remain not-working, then keep hammering at it
until they start working.

  I'm also wondering about tracking performance (% loss and RTT) vs
packet size and/or padding patterns.

  Reason why I'm asking all this is that our feed at work has recently
(past couple of days) developed some weird trouble.  It's intermittent
but persistent.  The provider has been very responsive in
acknowledging it and is trying to figure WTF is going on, but they're
lacking in immediate answers, and I'm trying to gather more
information, to help them as much as us.

  Can anyone here comment on this?

  (Aside: Fellow IT geeks may be interested in the failure mode.  My
local gateway (Linux box) is not getting an ARP reply for the peer
gateway.  With a sniffer, I see our gateway sending ARP queries, but
never an ARP response from the peer.  But I see the peer's ARP and our
response.  I also see TCP/UDP packets coming in, so the peer has our
MAC.  (Our box never replies, since it can't ARP.)  The provider says
they can see our CPE in their management system.  The trouble comes
and goes, with no pattern I've been able to discern.  Oh, and while
this happens, the ping test I'm running will occasionally report a
burst of packets with RTT's of 300+ seconds.  Not milliseconds,
*seconds*.  I don't know how TTL is even allowing that.  This is one
of those hair-reducing problems.)

-- Ben


More information about the gnhlug-discuss mailing list