I deeply apologize for the lengthy downtime of the website and MySQL database tonight.
Ubuntu kicked out an upgrade to mysql server that seriously broke it. With the update installed, as soon as someone tried to connect from anywhere but localhost, the server would get stuck in an infinite loop complaining of bad file descriptors and not a socket.
I tried to revert to the previous and only other version on the Ubuntu repository but it was also broke,
In addition, they hard coded in a limit of 65536 file descriptors which is not adequate for our sites traffic.
I tried to migrate to MariaDB but there was no clean path from MySQL 5.7 to MariaDB and nothing I tried would work.
I then attempted to install Community MySQL directly from the site but owing to the fact that I had a mysql user in the NIS database, it broke the install. However it did not log the cause or print an error that would give me a clue as to what the problem was so it took a lot of tracing and plodding around to find the cause and fix it.
We are now up and running with Community MySQL Server version 5.7 and it is back to functioning properly.
MySQL continues to run for a while and then exhausts file descriptors even though it’s set to more than a million. We are running a version that previously was stable. I’ve fixed the stuff in the start up script that it removed.
I am still trying to identify the cause and correct it.
We continued to experience database problems. I found that the upgrade had erased a change I made to the systemd script to increase the system limit on file descriptors. MySQL was running out of descriptors. I will probably update again and re-apply my fixes to the script that it will no doubt erase again but I am going to let it run with this fix for a while to make sure it is table first.
At 12:49 AM the client mail server which provides SMTP, IMAP, and POP3 for clients and web mail, crashed.
This is a virtual machine sitting on a physical host. The physical host was still running clean. This is the first time this particular machine has crashed since it was loaded in 2013, the software is stable. However, CentOS 6 did release a kernel upgrade in spite of the age of the software with absolutely no notes as to what they changed / fixed. I have applied this upgrade.
MySQL has also been somewhat unstable on the web server since an upgrade several days ago. I have downgraded MySQL to 5.7.15 which is the version prior to the recent upgrade which introduced instability.
There will be a hopefully brief maintenance outage during this time frame as the 100mbit/switch that was temporarily put in service when the gigabit switch failed Monday morning is being replaced with a new gigabit switch.
If all goes well this should only take a few seconds, however, the last time we did a couple of the machines failed to recognize the speed change without a reboot. Hopefully this will not be necessary this time.
The outage this morning from 1:20AM until 5:10AM was caused by a failed switch. The gigabit switch has temporarily been replaced with a 100mbit switch that I had on hand. Some services relating to mail took somewhat longer to restore owing to the machine not recognizing the speed switch without a reboot and then the reboot loaded a new kernel that apparently lacked NFS support.
I had to make some changes to our router configuration this afternoon.
The way the software in our router works, you go into a configuration mode, make all the changes you want to make, then commit or save those changes at which point they become active.
I did this, hit the SAVE button. It said, “Save Failed” and then crashed.
I had to drive down to the co-location facility and reconfigure the router to bring it back online.
In order to do this I had to change the IP addresses on one of my machines to 192.168.1.2 in order to communicate with the factory default address of the router and then to reconfigure it.
That all went well and I had the router back online by 3pm, but when I went to change the IP address back on the machine I used to configure the router, the
Crash ‘N Burn, No Return
graphical tools failed and screwed up it’s configuration past the point where it could be fixed by the graphical client.
I was not familiar with where Ubuntu keeps all of it’s net stuff, but found the files and got that machine restored to health as well. Most things were online by 3PM mail service, web service, and some of the shell servers. This particular box hosts a number of shell servers and so some were down until about 4:30pm when I got it completely restored.