Eskimo North


          [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

          Update on Eskinews Outage


          • To: outages-list@eskimo.com
          • Subject: Update on Eskinews Outage
          • From: Robert Dinse <nanook@eskimo.com>
          • Date: Thu, 17 Feb 2000 17:52:47 -0800 (PST)
          • Resent-Date: Thu, 17 Feb 2000 17:52:58 -0800
          • Resent-From: outages-list@eskimo.com
          • Resent-Message-ID: <"3BOxR2.0.wL5.vNAhu"@mx1>
          • Resent-Sender: outages-list-request@eskimo.com

          
               Eskinews was complaining about corrupted page table entries.  This isn't
          particularly suprising because there are race conditions on SMP platforms under
          Linux that aren't fixed, but they usually cause spin_lock deadlock instead of
          table corruption.
          
               At any rate, if I see those on a box I know it will become unstable,
          because you've got memory pages allocated to more than one process or not at
          all which is a bad thing.
          
               So I attempted to gracefully boot the machine before it crashed,
          unfortunately it did not gracefully halt.
          
               But here is where it gets weird.  When I tried to reboot, it would hang
          when it tried to mount the root partition read/write.  I could mount it read
          only, I could mount any other partition read/write.  Fsck could not find
          anything wrong.  I went in and snooped around with debugfs, couldn't see
          anything, and a few other tools.
          
               In short I figured there must be SOMETHING wrong with the root file system
          and that the tools couldn't detect but none-the-less prevented a mount from
          completing.  It did not print any error messages or other clues.
          
               So, I tried to steal a swap partition to make a new root partition, and
          when I did, it too hung when it tried to mount read/write.
          
               In January, I had upgraded the disk utils on this machine, which included
          a new mount command.  Just for kicks I tried the old mount command, it worked!
          
               So I put the old mount command back, but I am mystified because I've
          booted the machine half a dozen times probably since I upgraded the disk utils
          and this is the first time it's ever displayed this behavior.
          
               Now the machine has to run an fsck on all the partitions, and the spool
          partition takes about two hours.  Once that is done this should be back up.
          
               As soon as I can port the authentication utility to Redhat 6.1 we should
          be able to bring up the new CPU, and the race conditions present in the SMP
          kernel will no longer cause this instability. 
          
               I gotta say though this one really has me miffed; why mount would work
          half a dozen times and then suddenly refuse to function and only on the root
          partition.  Too weird, makes no sense.
          
          
          
          

          • Prev by Date: Eskinews outage 2/17 4pm
          • Next by Date: Mail
          • Prev by thread: Router / Authentication
          • Next by thread: Eskinews outage 2/17 4pm
          • Index(es):
            • Date
            • Thread