RH8 kernel stack traces

Mon Nov 10 23:41:28 EST 2008

On Fri, 2008-11-07 at 12:40 -0500, Larry Cook wrote:
> I am getting a variety of kernel stack traces (see attached file) from 
> my stock RedHat 8.0 system.  I've been running this system for years.  I 
> have never updated the kernel and I'm not aware of any system changes 
> I've made around the time this started happening.  Sometimes the kernel 
> stack trace will leave the system locked up and I have to do a hard 
> reboot.  Other times the system is still working.
> 
> Due to the apparent variety of stack traces I'm wondering if I have a 
> hardware problem, like the hard drive or memory.  Can anyone offer 
> advice as to what might be causing these problems if I've made no system 
> changes?

This looks like a failed memory chip to me.  It could also be other
components in that chain of process.  For example, the mother board
traces from your CPU to your Memory Chips might be shorted with
conductive dust, or your CPU may have overheated and failed.

It doesn't look like a problem with your hard drive or file system
because the problem seems to be occurring in the kernel code that
processes the inode cache (which stored in RAM).  When I've had hard
drives fail in the past, the Linux kernel struggled with the failed
components in DMA calls and controller/bus calls. 

To me, invalid pointers in the inode cache and mapping would indicate
that the data stored in RAM is getting corrupted.  RedHat used to
include a utility called 'memtest86' that you could boot to instead of
the RedHat installer.  The program would write to the full range of all
available RAM and report any errors.  It takes a significant amount of
time to run (overnight usually).   If RedHat no longer includes the
program on their distribution CD's, you can find more info and download
it from these URL's:

http://en.wikipedia.org/wiki/Memtest86
http://www.memtest.org/
http://www.memtest86.com/

I would recommend running memtest86.  If it finds any failed RAM
modules, I'd replace them.  If not, I'd try cleaning the computer and
attempting to monitor it's temperature and voltage under a system load
similar to the ones that were running during the crash if your
motherboard is capable of that kind of monitoring.

          - VAB

-
V. Alex Brennen       vab at mit.edu
Senior UNIX Systems Administrator
MIT Libraries   E25-131   x3-9327