Eskimo North

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

High Strangeness Outages

     In the last few days we've had not one but four servers that were
previously reasonably stable crash or go into some strange state, in the case
of two of them, repeatively.

     Of the four, only one is showing any consistency that give a clue as to
the cause and presently it is unclear if it's a hardware or software issue.

     Xena, the main file server now, was turned up in the co-lo on the 10th of
Dec, and until tuesday morning had never crashed.  On Xena, the crash indicated
a race condition in a memory scan routine.  CentOS just kicked out a new kernel
RPM which hopefully resolves this and it is installed.

     Eskimo has been failing always trying to write to a hardware location that
interestingly is an MMU control register.  That is, it's failing also while
trying to allocate memory or manipulate the memory map.

     Ultra1, the web server failed tuesday evening and the screen was blank so
no error information was recovered.  However, this morning it again crashed,
but this time it was halted like someone had plugged a keyboard in and hit
L1-A.  I've never had this occur spontaneously before.  This is in a cabinet
with not one but two locks and beyond that there is no obvious motivation.  
There are other customers in this co-lo facility that leave the backs off their
cabinets and all the equipment exposed for months at a time and nothing happens
to it so I think it's very unlikely that someone is there physically doing
something malicious.

     Ultra7, the mail server for client mail failed yesterday morning, again
with a blank screen.

     So we've got four different boxes acting up, and really with the exception
of eskimo shell server, nothing definitive.  I do not know at this time if
there is an environmental factor but if there is it's nothing obvious, the
temperature has remained well regulated as has humidity, and certainly there
hasn't been any indication of power instability while I've been there.

     There may be some form of denial of service attack, but if there is, it
affects multiple architectures (Sparc and Xeon) and multiple operating systems
(Redhat 6.2, Cent/OS 5.2, and SunOS 4.1.4), and I haven't seen any large packet
floods that would normally be associated with such activity.

     Also, something odd about the failure of ultra1 tuesday morning, the
machine wasn't actually entirely down, it would respond to ping, but no
application was functioning.

     And this morning, eskimo also got into a state where it would not accept
passwords from people coming in from outside.  It would if the person connected
from another internal machine.  You would think that would be something related
to a security configuration file but we checked that and found nothing.  After
a reboot it functioned normally again.

     I don't know if someone out there has an Eskimo voodoo doll they're
sticking pins in, but the last time I've seen multiple machines crash with no
good explanation was just before the 7.2 Earthquake in the south sound.

     At any rate; something strange is happening and we're still trying to
determine the cause.  In the short term there isn't much in the way of actions
we can take until we can make that determination.  In the long term, we are
moving towards virtualization in which we run virtual machines on physical
hardware an those virtual machines provide the services.  What this will enable
us to do when all is said and done is when one piece of hardware fails, we'll
be able to fire up a virtual machine on another piece of hardware restoring
service much faster.

     Right now when something goes into a zombie state, it requires a trip to
the co-location facility and the last few days every outage has occured during
rush hour.  During non-peak hours it's about a 22 minute trip, but during rush
hour it can exceed an hour on bad days.

 Eskimo North Linux Friendly Internet Access, Shell Accounts, and Hosting.
   Knowledgable human assistance, not telephone trees or script readers.
 See our web site: (206) 812-0051 or (800) 246-6874.