Help! Is this kernel or hardware problem? (fwd)

Sun Nov 27 01:58:01 EST 2005

On Nov 23 at 10:31pm, Steven W. Orr wrote:
> I just grabbed syslog for the event but I don't know what it means.

   Everything everyone else said is good and correct and should be listened 
to.  As I like to do, I'm following up with some more "academic" explanation.

> Nov 23 22:24:36 saturn kernel: scsi0:0:0:0: Attempting to queue an ABORT 
> message

  The SCSI mid-layer in the kernel is attempting to abort a SCSI command, on 
SCSI nexus "scsi0:0:0:0".

   The "SCSI mid-layer" is the part of the SCSI subsystem which sits between 
high-level drivers (like sd (SCSI disk), st (SCSI tape), and so on) and 
low-level drivers for host adapters (like your Adapted-based one).

   A SCSI nexus is a combation of SCSI driver (read as "card type/family"), 
bus (read as "controller" unless you have mutli-bus controllers), target 
(SCSI device ID), LUN (logical unit number, almost always zero).  For most 
systems, with one single-bus controller and no LUNs in use, all that matters 
it the second-to-last field, which tells you the SCSI device ID.  In your 
case, ID 0 (zero) -- typically a hard disk in most systems.

   SCSI aborts are usually triggered by timeouts (often logged immediately 
prior).  In other words, the system got tired of waiting for a SCSI command 
to finish. Timeouts are usually triggered in turn by bad devices, bad 
cabling, improper termination, loose connections, and/or the phase of the 
moon.

> Nov 23 22:24:36 saturn kernel: CDB: 0x2a 0x0 0x3 0x6 0xcc 0x48 0x0 0x4 0x0 
> 0x0

   CDB = Command Descriptor Block.  Or something like that.  Basically, a CDB 
is the in-memory representation of a SCSI command.  The rest is some kind of 
puke which probabbly means something to a SCSI wizard, but means nothing to a 
mortal like me.

> Nov 23 22:24:36 saturn kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins
[puke deleted]
> Nov 23 22:24:53 saturn kernel: <<<<<<<<<<<<<<<<< Dump Card State Ends

   All of that puke is typically only useful to someone who has a deep 
understanding of the kernel SCSI mid-layer and/or the device driver of the 
card you're using.

> Nov 23 22:24:53 saturn kernel: (scsi0:A:0:0): Device is disconnected, 
> re-queuing SCB

   "Device is disconnected" may be misleading if you don't know SCSI 
terminology.

   SCSI has a concept called "disconnection", where a device can logically 
"disconnect" from the bus while it is busy doing other things.  The classic 
example is sending a "rewind" command to a tape drive.  That usually takes a 
long time (many seconds or even minutes).  So the tape drive "disconnects" 
from the bus, freeing the bus for use by other devices (e.g., hard disks). 
When the rewind is finished, the drive signals for attention, the host 
reconnects, and things pick up from there.  If disconnection were not 
possible, the entire SCSI bus would be busy while you waited for the tape to 
rewind.

   So, the above message most likely means the kernel mid-layer has noticed 
that puke happened, and has re-queued the SCSI command (SCB).  I'm not sure 
if this log message refers to the original SCB that started all this trouble, 
or some kind of "abort message" SCB.

> Nov 23 22:24:53 saturn kernel: Recovery code sleeping

   The kernel is going to wait a few microseconds for things to settle down 
and stop puking.

> Nov 23 22:24:53 saturn kernel: Recovery code awake
> Nov 23 22:24:53 saturn kernel: Timer Expired

   Kernel is done waiting.

> Nov 23 22:24:53 saturn kernel: aic7xxx_abort returns 0x2003

   "aic7xxx" is a reference to a driver for the very popular Adaptec series of 
SCSI host adapter chips.  Presumably, 0x2003 means something to someone. 
Hopefully, it is some variation on "Success".  The fact that the abort 
returned at all is actually a good sign.  Particularly in less-recent 
kernels, Linux was famous for the SCSI subsystem going off into hyperspace 
and never coming back.

> Nov 23 22:24:53 saturn kernel: scsi0:0:0:0: Attempting to queue a TARGET 
> RESET message

   A "SCSI target" means the target of a SCSI command.  Typically, disks and 
tape drives and the like.  A "target reset" means the kernel is trying to 
tell your ID 0 device it should reset itself to a known state.  This is also 
good, relatively speaking.

> Nov 23 22:24:53 saturn kernel: Recovery SCB completes

   Presumably, this means the kernel has finished cleaning up all the puke, 
and will now try things again from the top.  If the system immediately pukes 
again, that's usually a sign of a major failure -- dead disk, buggy driver, 
fried card, etc.  If it doesn't puke again "ever", it may have been a random, 
one-in-a-million glitch.  If it continues puking perioidcally, that's when 
you have to start getting into the SCSI voodoo.

   http://www.scsifaq.org

-- Ben

"SCSI is *not* magic. There are *fundamental* *technical* reasons why you
  have to sacrifice a young goat to your SCSI chain every now and then."
                                         -- John F. Woods