Ubuntu Fixed

     For some reason the 19.10 upgrade removed nfs-common from Ubuntu, a package definitely needed in our environment.  This resulted in Ubuntu not mounting your home directory or mail spool.

     This problem has been resolved.

Maintenance Outage Friday Night

     When it rains it pours.  We have yet another failing hard drive.  This time on a server that has only a few shell server virtual machines, no exported file systems, so less problematic than the last.

     We will be taking a number of shell servers down Friday to replace a failing hard drive on a machine that is the physical host for these virtual machines.

     I haven’t decided exactly how I am going to go about this yet.  I can copy the data live then fix the boot block or I can shut the machine down and dd the data but that might take several hours so I’ll probably do the former.

 

Friday Morning’s Outage

     I apologize for the long outage this morning.

     Yesterday, I had to drive to Spokane and back which is a 300 mile drive from where I am, in both directions, so a total of 600 miles on the road, and in a miserable rental because my car is in the shop.

     Normally I would have been awoken by phone calls but was so exhausted I slept through eight calls.  My sincerest apologies!

     Just prior we had had what I thought at the time to be a hardware problem with one of our physical servers but after moving the software from that machine to another, the other exhibited the same symptoms so I know now it’s a problem with the kernel shipped with Ubuntu 19.10 and not a hardware issue.

     When I moved theses virtual machines, I neglected to tick the start on boot-up box on the replacement platform and because I just moved the software bug from one machine to another, it spontaneously rebooted this morning and those virtual machines did not come back up.

     Normally I am a light sleeper and a phone call would wake me but I was so tired I slept through eight of your calls before I finally woke and discovered everything had crashed and burned while I was sleeping.

     I will be going to the co-location facility and rebooting later today to get everything off the defective kernels.  Sorry that this further interruption is necessary but if I do not they will spontaneously reboot again while I’m sleeping.

     I do not know the exact nature of the problem but I do know none of the machines I have running on kernels I’ve built since have exhibited this problem and I suspect it is some sort of memory leak as machines with less memory seem to be more susceptible to early failure.

      The Ubuntu kernel is not well optimized for our work-load anyway so this will also have some performance benefits.

Road Trip

     I will be out of town today on a trip to Spokane.  If you need something and it is not an emergency, leave e-mail, use the ticket system, if it is call and leave voice mail.  I will check voice mail and e-mail as network access allows.

     To the best of my knowledge everything is back up and stable.  OpenSuse shows down in rwho / ruptime but this is because I recently installed Tumbleweed, the rolling release, and do not yet have an rwhod daemon that works with it.

Important News

     I will be out of town most of Thursday but will occasionally check e-mail.  If you call, leave a message, it will be automatically transcribed to e-mail which I will read at some point.  I will be driving most of the day so will only occasionally get a chance to check.

     This is not good timing because the recent upgrade to Ubuntu 19.10 turned out to have a kernel that does not play well with our biggest server and at the same time one of our two smaller machines is ill so a greater portion of the load is on this big machine.

     I have built a custom kernel which has the hardware portion of the large server stable but the guest operating systems still have the kernel that doesn’t work well and so they are unstable.

     Ubuntu is down and probably will remain that way until Friday, please use an alternative shell server such as debian, mint, julinux.yellow-snow.net, or zorin.

     The Web server is also running the same broken kernel but because it has 32GB of RAM allocated, it is taking it longer to screw up.  Hopefully it will remain stable until my return.  If I can get network connectivity I will work on it when I get a chance.

     A second issue that is causing some trouble is that one server is down due to hardware problems.  So two remaining machines are taking up the slack putting a greater load on them.

     Lastly, the amount of brute force password guessing attempts has increased 100 fold in the last two days.  It is normal for us to detect and lock out about 200 IP addresses per day but the last two days this has been in excess of 20,000, the majority of them have been aimed at Scientific7.eskimo.com.  Between the fact that few people use it and it is desirable to further constrain the success of these attacks, I’ve reduced it to one CPU core temporarily from the normal of eight.