[nzlug] Virtualization + ethernet/ping issues.
Michael Hutchinson
mhutchinson at manux.co.nz
Wed Feb 13 11:52:11 NZDT 2008
Hi all,
I've been getting a rather strange problem with the ping program on one
of our Debian Sarge servers. It happens to be a monitoring server,
running Nagios 1.3. Most of our monitoring relies on pinging host
addresses to determine if they are alive or dead. It is worth mentioning
that the monitoring server is virtualized with VMWARE. Two other linux
servers are Virtualized on the bare metal system, as well as 3 windows
servers.
Occasionally, a site will be reported as down, where it actually is not.
We often check using a server on the same part of the network as the
monitoring server. In this case, the ping packets come back just fine on
the other server, so our monitoring server has a fault.
At first it was thought there was something about the network that was a
problem, but after some testing, it showed that if we use the command
'ifconfig' to bring the Ethernet adaptor down, and then back up, the
pings will resume as normal.
When the anomaly is present, and the monitoring server "cannot" ping the
"affected" destination host, I have done tcpdump monitoring which
clearly shows that the ping packets are coming all the way back to the
monitoring server, and are not being fragmented, or do not have strange
TTL's - or anything weird, they just look like proper ping replies. But
it is as if the ping program is blind to them.
Today, when it happened again to a monitored site, I tested the "fping"
program. This was also unable to interpret the returning ping packets. I
tested the "arping" program, which seemed to work just fine, and have
the expected response.
My question is, are there any other tests I can use on the monitoring
server to see what is causing this? I find it so peculiar that using
ifconfig to bring eth1 down then up again fixes the problem. It is
almost as if it is a Virtualization issue, caused by Vmware maybe
polling the Ethernet adaptor (virtual or otherwise) in a way that
somehow torments a Linux Guest operating system.
I know my last statement seems like a bit of a reach, but it comes about
because our other monitoring server (which is on a different part of the
network) is Bare Metal, and has never exhibited this problem at all. It
also uses Debian Sarge (with the same apt-get upgrades applied), and
Nagios 1.3.
Does anyone have any ideas for troubleshooting this problem? We've run
out :)
Thanks in advance,
Michael Hutchinson
Manux Solutions Ltd
More information about the NZLUG
mailing list