[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Update on Eskinews Outage
- To: outages-list@eskimo.com
- Subject: Update on Eskinews Outage
- From: Robert Dinse <nanook@eskimo.com>
- Date: Thu, 17 Feb 2000 17:52:47 -0800 (PST)
- Resent-Date: Thu, 17 Feb 2000 17:52:58 -0800
- Resent-From: outages-list@eskimo.com
- Resent-Message-ID: <"3BOxR2.0.wL5.vNAhu"@mx1>
- Resent-Sender: outages-list-request@eskimo.com
Eskinews was complaining about corrupted page table entries. This isn't
particularly suprising because there are race conditions on SMP platforms under
Linux that aren't fixed, but they usually cause spin_lock deadlock instead of
table corruption.
At any rate, if I see those on a box I know it will become unstable,
because you've got memory pages allocated to more than one process or not at
all which is a bad thing.
So I attempted to gracefully boot the machine before it crashed,
unfortunately it did not gracefully halt.
But here is where it gets weird. When I tried to reboot, it would hang
when it tried to mount the root partition read/write. I could mount it read
only, I could mount any other partition read/write. Fsck could not find
anything wrong. I went in and snooped around with debugfs, couldn't see
anything, and a few other tools.
In short I figured there must be SOMETHING wrong with the root file system
and that the tools couldn't detect but none-the-less prevented a mount from
completing. It did not print any error messages or other clues.
So, I tried to steal a swap partition to make a new root partition, and
when I did, it too hung when it tried to mount read/write.
In January, I had upgraded the disk utils on this machine, which included
a new mount command. Just for kicks I tried the old mount command, it worked!
So I put the old mount command back, but I am mystified because I've
booted the machine half a dozen times probably since I upgraded the disk utils
and this is the first time it's ever displayed this behavior.
Now the machine has to run an fsck on all the partitions, and the spool
partition takes about two hours. Once that is done this should be back up.
As soon as I can port the authentication utility to Redhat 6.1 we should
be able to bring up the new CPU, and the race conditions present in the SMP
kernel will no longer cause this instability.
I gotta say though this one really has me miffed; why mount would work
half a dozen times and then suddenly refuse to function and only on the root
partition. Too weird, makes no sense.