Mail Client Server Stability

     We have been  plagued with a recent rash of hardware and software problems as of late.

     A current problem I am seeing is the older Redhat 6 servers are mysteriously powering themselves down.

     It may be related to kernel upgrades I attempted recently.  The newer kernels required a newer version of acpid.  The acpid daemon is a program that listens to things like power and reset buttons and then acts accordingly.

     I had to back out the kernels because the mandatory locks which mail really requires to operate properly were broken under 4.8 at least.  I neglected to back out the updates to acpid at the same time.

     In addition, on mail, sogo, a package we have in place to provide some interoperability with ms-exchange clients, overwrote some of the system python modules and broke some of the admin tools.

     I’ve fixed these things and I’ve turned on some additional logging so that if it happens again it will hopefully provide additional information for troubleshooting.

Web Server

     I deeply apologize for the lengthy downtime of the website and MySQL database tonight.

     Ubuntu kicked out an upgrade to mysql server that seriously broke it.  With the update installed, as soon as someone tried to connect from anywhere but localhost, the server would get stuck in an infinite loop complaining of bad file descriptors and not a socket.

     I tried to revert to the previous and only other version on the Ubuntu repository but it was also broke,

     In addition, they hard coded in a limit of 65536 file descriptors which is not adequate for our sites traffic.

     I tried to migrate to MariaDB but there was no clean path from MySQL 5.7 to MariaDB and nothing I tried would work.

     I then attempted to install Community MySQL directly from the site but owing to the fact that I had a mysql user in the NIS database, it broke the install.  However it did not log the cause or print an error that would give me a clue as to what the problem was so it took a lot of tracing and plodding around to find the cause and fix it.

     We are now up and running with Community MySQL Server version 5.7 and it is back to functioning properly.

MySQL Database

     MySQL continues to run for a while and then exhausts file descriptors even though it’s set to more than a million.  We are running a version that previously was stable.  I’ve fixed the stuff in the start up script that it removed.

     I am still trying to identify the cause and correct it.

 

Continued Database Problems

     We continued to experience database problems.  I found that the upgrade had erased a change I made to the systemd script to increase the system limit on file descriptors.  MySQL was running out of descriptors.  I will probably update again and re-apply my fixes to the script that it will no doubt erase again but I am going to let it run with this fix for a while to make sure it is table first.

Mail Server / Web Server

     At 12:49 AM the client mail server which provides SMTP, IMAP, and POP3 for clients and web mail, crashed.

     This is a virtual machine sitting on a physical host.  The physical host was still running clean.  This is the first time this particular machine has crashed since it was loaded in 2013, the software is stable.  However, CentOS 6 did release a kernel upgrade in spite of the age of the software with absolutely no notes as to what they changed / fixed.  I have applied this upgrade.

     MySQL has also been somewhat unstable on the web server since an upgrade several days ago.  I have downgraded MySQL to 5.7.15 which is the version prior to the recent upgrade which introduced instability.

Maintenance Outage 1/18/17 1:30-2:30

     There will be a hopefully brief maintenance outage during this time frame as the 100mbit/switch that was temporarily put in service when the gigabit switch failed Monday morning is being replaced with a new gigabit switch.

     If all goes well this should only take a few seconds, however, the last time we did a couple of the machines failed to recognize the speed change without a reboot.  Hopefully this will not be necessary this time.

 

Bad Switch

     The outage this morning from 1:20AM until 5:10AM was caused by a failed switch. The gigabit switch has temporarily been replaced with a 100mbit switch that I had on hand.  Some services relating to mail took somewhat longer to restore owing to the machine not recognizing the speed switch without a reboot and then the reboot loaded a new kernel that apparently lacked NFS support.

SquirrelMail

     An update replaced PHP 7.0 with PHP 7.1 which Squirrelmail does not work with yet.  I have reverted the PHP version to 7.0 to restore Squirrelmail to service.