Iglulik Spontaneously Booted about 3AM

     Iglulik spontaneously rebooted about 3AM, this is the second time it has done this and Ice, the server which has /mail spool, has also spontaneously booted once.  It appears that 5.7 has some stability issues when NFS file systems are exported.  The machines not exporting file systems have been completely stable.

     I know that the Linux community is doing a lot of work on NFS right now, which is good because NFS in Linux has been buggy forever but not in ways that crash machines but in ways that when a server goes away and comes back the clients do not always recover properly.  There was a lot of work in the 5.7 kernel and they just came out with a new nfs-kernel package.  So hopefully they’ll get this resolved soon.

     So I’ve compiled 5.7.1 now on Iglulik.  If it does not spontaneously boot again before tonight I will be rebooting in the early AM to load the new kernel.

MxLinux Down for Upgrade

      I am taking mxlinux down to upgrade from MxLinux 18.3 to MxLinux 19.2.  Because of a change in the debian code base upon which MxLinux is based and because they’ve bastardized Debian beyond the point where normal upgrade procedures can be used, a complete re-install is necessary so it may be down for a few days.

Mail Issues

     We are having some problems with the mail sub-system that I am still struggling to understand.  It is responding slowly and refusing service to some hosts while permitting it to others and I have not yet been able to determine why.

Ice Spontaneously booted today

     Ice, a machine which holds the /mail partition as well as mail, mx2, and some private virtual machines, spontaneously rebooted today.

     There would appear to be some stability issues in 5.7 yet.  I suspect these related to NFS as both ice and iglulik export major NFS partitions used by the rest of the machines here and the physical hosts which do not as well as the virtual machines which do not have all been stable.

     There were some major changes in the NFS v4.2 code in 5.7 which I hope, when adequately debugged, will result in better reliability so I’m not giving up on this kernel just yet but will keep up with point releases and try to identify in greater detail what is failing.

     Everything is back in service but I still need to check all the mounts.

Iglulik

     Iglulik, the host which hosts /home directories spontaneously booted about 3pm today.  I do not yet know what caused this.  I will need to re-check all the NFS mounts on the other hosts since invariably some of them fail to remount correctly.

Hosts NFS / NIS Mounts / Binding Verified

     Hosts NFS mounts have been checked and NIS bindings have been checked.  A few hosts failed to come up completely after reboot.  All of these problems have been resolved and all hosts are operational except for OpenSuse.

     OpenSuse has a problem with a library that breaks NIS.  I opened a ticket on this close to half a year ago.  If it is not resolved soon I am going to discontinue this host.

     If anyone has any suggestions for a better Linux distro, please e-mail them to nanook@eskimo.com.

     Thank you.

Reboots Complete – Still Checking Hosts

     The reboots are completed but I am about an hour behind schedule.

     Two things set me back.  First, SOMETHING installed dnsmasq on my stealth master DNS server.  It is a master that is hidden behind a firewall so that hackers can’t inject nastiness into it and then it supplies all the secondary servers with zone records.

     Because it has bind, it does not need dnsmasq.  Further, dnsmasq breaks bind IF it starts first because it uses the same network port (53) as bind thus blocking bind’s ability to attach to that port and function.

     So at some past point when I rebooted, about a week ago, zone records just now expired and all the secondary servers quit serving them, so when I went to ssh into the server, my workstation couldn’t find them (and neither could any external computer), thus it was broken for everyone but because I had posted about the reboots everyone was expecting an outage and nobody called so I was unaware until i tried to connect and then it took me a little while to figure out what the hell was going on.

     And then once that was resolved, one of Canonical’s engineers (the Ubuntu developers) asked me to try an experiment for them in order to try to nail down a problem with a apparmor profile for libvirtd, and that took additional time.

     Everything is rebooted now but I am still checking for proper NFS mounts and NIS binding of hosts to servers.

Server Reboots

     I am planning on rebooting physical hosts which will affect all services tonight starting at midnight.  I should be complete by about 12:30, and then another hour or so to check all the servers for proper NFS/NIS mounting/binding which is not 100% reliable under Linux.