[nzlug] other question about process
Martin D Kealey
martin at kurahaupo.gen.nz
Fri Sep 1 14:36:04 NZST 2006
On Fri, 1 Sep 2006, anru chen wrote:
> today, some thing happened again, but this time is httpd process,
> the procee just can not be killed , and not in zombie status.
> it just use 99% cpu time.
>
> my question is why process can not be killed and still use cpu time?
> any tool to trap this kind of problem ?
The long answer to this is that although you can send a signal to a process
at any time, it is only delivered (and hence only takes effect) when (just
before) the process transitions out of kernel mode back to user mode; this
could be when it returns from a kernel call, or when it resumes after a
context switch or sometimes after a hardware interrupt -- provided it was in
user mode before the switch or interrupt, so that it's going into user mode.
Actually, the forced context switch *is* simply responding to a hardware
interrupt -- from the clock.
At least, that was the story as of when I last delved into Unix kernels, and
AFAIK Linux is enough like other unix kernels that this is *likely* still to
apply.
Some types of kernel activity are counted as on-going CPU usage, while
others aren't; things like waiting for device drivers generally don't count
towards CPU usage, but things which are normally "short" do because the
overhead of turning off the CPU-usage counter isn't worth while. Hardware
interrupt servicing is a good example of this, but things like mmapping a
page that's already physically resident would probably qualify too.
So what you have is something that is blocking inside the kernel for seconds
or minutes when it would normally take no more than a few hundred
microseconds. Generally that means some sort of hardware fault, and it
needn't be related to the activity the process is purporting to undertake.
For example, I've seen a system lock up when the acpid process is killed
because the battery monitoring hardware keeps posting interrupts to warn
about something, and when that isn't acknowledged, it just keeps doing it.
Discovered that one in a system that had checkinit for /etc/init.d/acpid set
to the wrong sequence number for shutting down (too early).
USB is another possibility, if you have dirty/corroded connectors that
keep triggering insertion and removal events.
Dud memory is a possibility; check it.
CPU overheating
PSU instability (due to overheating or otherwise)
Faulty or improperly seated PCI card
The possibilities aren't quite endless, but they being to feel like it ...
-Martin
More information about the NZLUG
mailing list