[nzlug] Ping/Traceroute issue.. Nagios involved.. VMWare involved..

Michael Hutchinson mhutchinson at manux.co.nz
Tue Dec 11 09:48:19 NZDT 2007


Hello everyone,

 

I have a peculiar issue with a linux server at work. Let me explain a
bit about the setup.

The linux server, Bob, is running in Vmware, along with a couple of
windows servers on the same metal. Its role is monitoring our
network/internet links and other servers with Nagios (1.3). Bob is a
Debian 3.1 "Sarge" install, and is kept up-to-date with security updates
etcetera.

 

We have noticed an anomaly recently, to do with the ability of Bob to
ping certain monitored sites. Occasionally, PING will start failing for
one site and continue to fail for some time. If we take any other
computer on the same subnet, and get it to ping the site, it works. Bob,
however, cannot. Bob still monitors all other sites just fine. We have
used other PC's right next to Bob (same network, same external IP
address) both with windows and linux OS's but Bob is the only one with
the issue. 

 

If one of our monitored sites has other services monitored other than
ping, it is only ping that will stop working. Example : serverA has
Remote Desktop Protocol available as well as Ping. When this problem
happens, the monitored host will fail to ping for Bob, but the RDP test
will come back fine.

 

We (here at work) have checked all the network stuff over for anything
that might be a firewall/router rule blocking the ping attempt, but alas
nothing is in the way. This has even happened to internally monitored
sites, so it doesn't make sense that a firewall is stopping the ping.

 

I have considered that we might be monitoring too many sites in Nagios,
but it seems to be doing a complete circuit of checks within about 7
minutes. Also taken into consideration was hardware failure - as this
problem happens approx every 2 or 3 days, then resumes normally after a
couple of hours, I'd expect it is not a hardware issue.

 

Several people have brought up the fact that Bob is vitrualized and that
could be the issue. We have had VMWare issues to do with Timekeeping and
have ensured that our Vmware clocks are setup correctly and referencing
the correct Hardware clock for the metal associated with the box. The
time keeps really really well, but it was suggested to me that if the
time were to wander, it could cause an issue with the ping replies (ie:
a ping reply coming back with a  timestamp earlier than when it was
sent) - but that would make it a continuous issue and we'd have sites we
could not monitor 24/7 like we do.

 

Any help with this issue would be greatly appreciated. I have found no
software to help detect the fault, and am not sure what the next step
is. Well, I do, but it involves de-virtualizing Bob, and testing all of
the hardware it currently sits on - which will be a costly exercise and
will cause downtime, when I don't firmly believe that we need to.

 

Thanks for any suggestions in advance,

Michael.

Manux Solutions Ltd

mhutchinson at manux.co.nz

 

 

 

 



More information about the NZLUG mailing list