An upgrade from 14.06 to 14.10 broke several applications for which no update was available. I will be cleaning those up and then proceeding with an upgrade to 15.x this evening. This is somewhat involved so may not be finished for several days.
I will be taking the big server that houses all users home directory off line to add additional hard drives Friday evening probably starting around 10:30PM. This will require pausing incoming e-mail as well since .procmailrc files and local e-mail boxes reside in home directories. Shell servers will be unavailable and all web services will be down. I expect this maintenance to take less than an hour.
Later in the morning I will be booting some servers into 5.1 kernels. 5.0.0 has proven to be an extremely unstable kernel so trying to get everything off of it before it breaks anything else like it did this morning. This will only involve brief outages.
Sometime this morning Igloo, the server that hosts the mail spool, just stopped serving files in /mail via NFS. It wasn’t a case of NFS dying altogether as other files systems still worked.
I wasn’t able to troubleshoot much further because when I got to the co-location facility, the console was dead and it would not allow remote logins. I was forced to power cycle to reboot after which everything appeared normal.
These machines were recently upgraded from Ubuntu 18.10 to 19.04 and that upgrade included the first 5.0.0 Linux kernel. I have had issues with that kernel on other machines and have resolved them by compiling 5.1 which appears to be more stable. I am doing that for this machine now which means there will be an additional interruption when I boot into the new kernel.
For workstations 19.04 is mostly okay, for servers it’s a piece of crap and has kept me up all night last night and tonight fixing things it broke. The new kernel, 5.0.0-14 is basically non-functional, networking either doesn’t work at all or randomly goes away. One machine went into hibernate mode all by it’s lonesome. I am so disappointed that Canonical would release such a half-baked distribution, not at all like them.
Well hopefully I will get home and things will be working this time. I am tired and hungry.
I got about 30% of what I set out to do tonight done. Consequently there will be additional downtime tonight and next weekend. I’m going to have to order an external case. I thought I had more internal drive bays than I do.
We found servers in our logs that were not in our server configuration.
Further investigation determined that GlobalPOPs had added additional radius servers but had not informed us. Requests coming from new servers not in our configuration were rejected causing authentication failures.
I obtained a complete list of their radius servers, added it to our clients.conf file and now authentication is functioning again.
Presently, our dial-up customers are unable to authenticate.
We have tested our radius authentication servers and those of our wholesale provider and found the issue to be with the providers authentication servers.
I attempted to generate a ticket via the normal means at GlobalPops loginto.us website and found the ticket link broken on their website as well.
Since they normally do not allow phone tech support and have no means of contacting other than the now broken ticket system, I called their retail dial-up center and got them to generate a ticket for me. I was assured I would receive a copy in my e-mail and someone would contact me but so far neither has occurred.
Work I had planned for Friday Evening / Saturday Morning will be rescheduled for Saturday Evening / Sunday morning owing to it taking me longer to get things prepared than I had anticipated.
Late Friday evening into early Saturday there will be some downtime for system maintenance. First probably around 10:30, I will be taking the server that hosts home directories, out of service for up to an hour. The reason for this is I will be adding a pair of 10TB drives to allow us to increase the size of the /home directory partition.
Next I will be moving home directories to this new partition. While this process will take some time, only during the final aspect when I am ready to switch partitions, will additional downtime be required. At that point I will need to reboot the machine.
Then I will need to reboot a number of other machines in order to make a kernel upgrade active.
Warning!!!! This security exploit has not been widely published but it IS actively being exploited. Someone caused my server that houses our customers /home directories to spontaneously reboot trying to exploit it. Fortunately the kernel logged their attempts. See: https://www.kernel.org/doc/html/latest/admin-guide/l1tf.html In our cause I performed measurements on system load, web page loading times and latency with and without this CPU feature turned off and in our case it made no measurable difference so I turned it off with: echo ‘off’ > /sys/devices/system/cpu/smt/control. I put this in /etc/rc.local which is enabled on our machines for this and some other adjustments.