Eskimo North

Server Outages

     We've had a number of server outages over the last couple of weeks, most
the last couple of days.  The majority of these friday and saturday were heat
related, the A/C in the co-location facility did not seem to be functioning
properly, it has been very warm, and inside the equipment cabinet even more so.

     Today, ultra1 crashed again, this affects the web and IRC services. This
time there was an error on the console indicating a FPU float store alignment
error.  This only occurs if the CPU/FPU module is bad or the kernel is corrupt.

     To check the latter, I built a new kernel from existing objects without
recompiling and compared to the old, it had been corrupted.  Most likely it got
corrupted during one of the heat related crashes.

     In any event, I recompiled the kernel and replaced the corrupted kernel,
hopefully it will be more stable now.

     There were neither any disk errors nor memory errors, so the corruption
does not appear to have been caused by hardware errors.

