Eskimo North

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


     Until recently, Linux on an Ultra2 has been a very stable platform for us.
Several months ago, I added a second CPU to the main file server and it became
unstable.  So before we moved the servers to the co-location facility I removed
the second CPU and went back to running it as a single CPU box.

     We put together an entirely new file server, also based on an Ultra2
Creator 3D machine, pretty much identical except for faster larger drives, and
it locked up today with no errors printed.

     This file server has user directories and the mail spool files.  It is
central to many services functioning properly.  Many of the services we can
achieve redundancy by setting up multiple servers, but this function I haven't
found an easy way to make totally redundant and at the same time it is central
to almost everything else.

     There is a network file system called Coda that looks like it could do
this, allow the file system to use several redundant servers, at present it's
not usable though until we get rid of the remaining SunOS boxes, and even then
I don't know if things like special files (named pipes or fifos, sockets, etc)
are supported, if the Unix permissions symantics are supported, etc.

     So I'm interested in other solutions people may know of.

     Also, in the interest of having the existing machine recover from a crash
gracefully, I've already made some changes to the initialization scripts that
should help but I'd like to know about watchdog timers.  The kernel
configuration menu has a software watchdog timer as an option, but the
documentation doesn't seem to be very helpful.  Also, the prom configuration
for the box has an item, "Watchdog-Reboot?", which implies it's got a hardware
watchdog timer.

     Anybody have any advice on using either the software watchdog timer or
presumably a hardware timer?

     The documentation in the kernel source tree for the software watchdog
shows how to use /dev/watchdog, only the device does not exist, the MAKEDEV
script supplied with RedHat 6.2 doesn't know about it, and I don't know where
to find the major and minor device numbers.  I also can't help but wonder if
the machine hangs, if that's going to "hang" the software watchdog timer as
well and would prefer a hardware solution that will work even if the CPU is
locked up tight.

     Lastly I'm looking for a cost-effective box that will allow us to remotely
power cycle machines and gain console access via serial ports, around 8 ports.
I've found such items but they've been donate-a-limb priced.