Eskimo North


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Server Failures




     We had a catastrophic failure of the main NFS server that houses users
directories, mail spools, and ftp files.

     The file systems were all irrecoverably scribled and it was necessary
to restore from backups written 6/3/2004.

     This is the first time we've had a system-wide failure of the main file
server since the hack of 1995.

     As near as I can tell, DLT-4000 drives which we use for backup do not
seem to peacefully coexist on the same SCSI bus as the Seagate 70gb drives
we use for storage of these files.  Attempting to write a backup with 512k
blocks caused the system to go bezerk and just totally scrible the drives.

     On the old file server we had a seperate SCSI controller just for the
tape drive and to avoid a repeat that's probably what I'm going to have to do
with the new machine.

     We put this new file server together in early June and moved files over
to it to address performance issues with the old machine.  The disk I/O
requirements and disk space requirements were exceeding the capacity of the
old system.

     Since the day I got the files moved over I've had problems trying to write
backups on this box, bus hanging, etc, and it finally resulted in catastrophe'.

     I really hate it when something like this happens because new people are
going to think I've only been with you X months or weeks and already there is
a major outage, but this is the first major file server loss like this in nine
years, and the last one was human induced.

     I'm really uncomfortable with the situation though, but until recently I
knew of no way to have redundancy across multiple file servers.  Recently, I
learned of the Coda system, which can provide that redundancy, but I still have
some unresolved issues with it.

     Those of you who have been with us a long time, I would really very much
appreciate your support and understanding.