Mail

     There was a problem with mail that was related to the upgrade and the issue with NFSv4 not recovering properly.  The client mail server, mail.eskimo.com, did not properly remount users home directories after the file server was brought back online.  This was corrected around 7:30AM Pacific Time.

Upgrade Completed — But it was UGLY

     One server took over half an hour to boot, the second 45 minutes later still didn’t come up so I drove down to the co-lo facility and booted it.  It was running but just hadn’t completed the start-up.

     I’ve discovered several problems.  The new ntp daemon that “fixed” the security problems appears broken, at least systemd tries to start it unsuccessfully half a dozen times before it takes.  Each time involving a somewhat lengthy timeout.  I am guessing this may have to do with the reachability of servers but I really don’t know at this point.

     There is either a problem with my NIS configuration or ypbind doesn’t work correctly.  The behavior is not well documented in the manual pages.  It tells you what valid entries are but not what order you use them in.

     I have multiple servers configured as:

domain eskimo.com

ypserver (ip address)

ypserver (ip address)

     But rather than try each in order, it seems to try one and if that fails it gives up.  There seems to be no effective way to specify multiple servers except for broadcast and that has serious security issues.

     Then NFSv4, when an NFS server goes away and then comes back, NFS should recover automatically but it does only about 75% of the time.  In one instance, I had to restart the nfsd service on the server to get things to mount again.

     All these issues combined made for a lot of hair pulling.

Upgrade Progressing

     The machine with the mail spool took a long time to boot but did come back up and is now up and running with the new release which is Ubuntu 15.10.

     The machine that houses /home directories is still in progress.  I had to interrupt and re-start the upgrade which can end up with a lot of dpkg –configure -a incarnations before everything is finally current.  Not good to interrupt but had no choice as it hung during the os-prober.

Upgrade Not Going Well

     The machine with the mail spool completed the upgrade up to the point of reboot, but it doesn’t appear to be coming back up so I may have to go to the co-lo and physically boot it.  It might be stuck in the boot process, may not have shut down completely, hard to say at this time.

     The machine with the home directories got stuck at a point in the upgrade process and I may be forced to interrupt it and finish by hand (which is a painful process).

Release Upgrades – Reboots of Everything Tonight

     I am performing release upgrades on the file servers hosting /home and /mail partitions.  This will require a reboot of pretty much everything at the completion.

     I am doing this because in the recent event where everything Intel went down, the machines with 15.10 recovered on their own, those with 15.04 required manual intervention.  So I am upgrading the remaining machines on Ubuntu 15.04 to 15.10.

Web Mail

     Yesterday a customer had a problem with web mail that was reproducible on browsers other than Firefox.

     I found that the version of SquirrelMail we had was not compatible with the version of PHP and upgraded it to the latest and greatest.  This resolved the issue for this customer.

     If you notice anything else strange with web mail please let me know.

Fedora Up

     Fedora is back online.  What got hurt was the Ethernet hardware address. Not really sure how that happened, but one digit changed and no longer matched the conf file so it wouldn’t bring up the network as it couldn’t find the device.

Fedora, Ubuntu

     Fedora is trashed, something is broken with the virtual disk image.  I am restoring from backups.  This will take some time, perhaps an hour.

     Ubuntu did not mount all the NFS partitions correctly, in particular home.  I am working on correcting that.

     Everything else appears to be operational.

 

Outage

     Tonight, around 7:30 our router and most of our servers crashed.  The router rebooted, only one of the Intel servers came back up, the Sparc servers all survived the event.

     I suspect a power hit, but I tried to call the people at Isomedia tech support and only got voice mail with a full mailbox so I was not able to find out what happened tonight.

     I rebooted and brought all the servers back online.  I did discover some configuration errors with nsswitch.conf on one server and fstab on another that prevented them from fully coming back on their own and corrected those.