Server issues baffling me...

Bruce Dawson jbd at codemeta.com
Wed Oct 22 20:16:08 EDT 2008


Neil Joseph Schelly wrote:
> I'm hoping someone can say they recognize this and that if I press 'Alt-p' or 
> something, it will all go away.  I am not that optimistic, but I figured it's 
> worth a try.
>
> I have a server that I can turn on in our office, plugged into the wall, and 
> it will work fine for days, weeks, whatever.  I have never had a problem 
> there.  When I bring it to the datacenter, it won't finish booting before it 
> starts to get hda DMA timeouts.  It's the errors I typically associate with a 
> failed drive.  Without fail, the machine does it every time it's booted up in 
> the datacenter.  And without exception again, it works fine in the office.
>
> To replicate the error in the office, I've tried switching the IDE cables, 
> running badblocks or other disk-thrashing sorts of programs like dd 
> if=/dev/hda of=/dev/null many times.  It's run for over a week without any 
> issue.  I've tried it with the network interface at full and half duplex.  I 
> tried running the machine in a closed room that probably got up to about 
> 75-80 degrees or so in temperature.
>
> To prevent the error in the datacenter, I've tried booting it with different 
> kernels. I've disconnected the network cable so that it's just power and a 
> serial console.  I also did just power and a monitor/keyboard.  No matter 
> what I try, it never gets to finish the booting process, not even to 
> single-user mode, before the timeouts start filling the screen.
>
> Has anyone seen any behavior like this?  At this point, I don't even know 
> where to look.  I can't imagine that there's actually an element of our 
> office that provides a better environment for machines and the office power 
> surely can't be any better than what's at the datacenter either.  No other 
> machines are exhibiting this behavior.  The server in question had been 
> running fine in the datacenter for months until this apparent disk failure 
> occurred.  I replace the disk and it worked for another month.  I replaced 
> that disk under warranty and the new one never booted up right.  I don't 
> believe I've actually got 3 hard drive failures in a month's time, but I 
> don't know what else to look at.
>
> Help...
> -N
>
>   
Many years ago I had a similar problem, but the system had another drive
in it, and I just used that one and gave up on using the bad drive. Many
moons afterward (1-3 years; not sure how long, I just remember it was
long enough for me to forget about the bad disk) the power supply died.
I replaced it. And then both drives started working.

You may want to put the drive on an unused pigtail from the power supply
before actually swapping out the power supply.

Your office probably has "different" power from the data center.

Also, if possible, use an oscilloscope to check the power at both places
and see if there are obvious differences. Some data centers use
"purified" power, which seem to cause problems for some systems
(switching power supplies?), but works fine for "server class" systems.
Unfortunately, I wasn't at that client long enough to chase down the
problem (not that they wanted a "programmer" fooling with a 'scope in a
production environment anyway).

--Bruce


More information about the gnhlug-discuss mailing list