Web Server Reboot

     I will be rebooting the web server at approximately 2AM in order to make kernel and glibc updates effective.  This will take approximately five minutes.  Reboot on this machine is somewhat slow because of the use of caching of file systems in memory which provides faster web response but takes some time to flush upon a reboot.

Switch Replaced

     I replaced the 100 mbit switch by a new Linkeyes gigabit switch.  This is a made-in-china brand like TP-Link but unfortunately LinkSys no longer makes rack mount equipment, Cisco bought them out and discontinued those products and I’m not a big fan of Crisco equipment.  Hopefully this will last longer than the two TP-Link switches did.

 

Switch Replacement

     I am going to replace the 100mbit switch that I put in place temporarily when our last gigabit switch failed with a new gigabit switch made by a different manufacturer tonight.

     There will be very brief (under 2 seconds) interruptions in connectivity that may cause a momentary pause in videos being watched or keystroke echoes.  There is a remote possibility that I may have to reboot the mail server if it does not automatically recognize the speed change.  In the past it has always recognized increases but not decreases so this probably will not be necessary but it is a possibility.

Yahoo Group

     This is why I do not use the Yahoo group for status when we have an outage and instead ask people to check here if our own website is down.


Hello,

Your message to the EskimoNorthUsers group was not approved.
The owner of the group controls the content posted to it and has the
right to approve or reject messages accordingly.

In this case, your message was automatically rejected because the
moderator didn’t approve it within 14 days. We do this to provide a
high quality of service for our users.

A complete copy of your message has been attached for your
convenience.

Thank you for choosing Yahoo Groups

Regards,

Yahoo Groups Customer Care

Switch Failed Again

     The new replacement Gigabit switch, a TP-Link, failed less than a month after being placed into service.  Suffice it to say that I’m going to try to find something other than another TP-Link to replace it with.

     Bought a different brand, it will be here in a few days.  With any luck I’ll be able to replace it Saturday night.

Iglulik / Mail

     No process has run up against the memory limits I put in place on Iglulik and memory usage has remained stable since the kernel upgrade to 4.8.

     This makes it likely that the issue was a bug in the 4.4 kernel on that machine.  It was the only machine I had running that particular kernel.

     It is possible that whatever went wrong before just hasn’t happened again yet, but if it does hopefully the limits in place will identify the process, prevent the virtual machines being killed by the Linux OOM killer, and provide some diagnostic information.

 

 

PHP on CentOS 6

    Centos6 ships with php 5.3, a rather antique version and not the version we use on our web server.  I installed php70 from remi repositories but it did not work as advertised, upgrade php did not remove all pieces of 5.3, so we ended up with a mix of php 5.3 and php 7.0 that did not play together well and caused errors on every update.

     I deleted ALL php from this machine today and am in the process of re-installing PHP 7.0 so temporarily there will be many missing modules until this process is completed.

Maintenance Completed

     I’ve completed the maintenance I had planned for this tonight.  Iglulik which is the server that hosts the mail spool and several virtual machines, including mail, has been sporadically running out of memory invoking the Linux OOM killer which kills the largest process and that is usually the mail server process however I don’t know that it is it that is consuming the memory as it appears normal sized in the kernel error messages.  I suspect something else is but is very transient so by the time the killer walks the process table, the real culprit is gone.

     I have upgraded this machine’s kernel from 4.4 to 4.8 kernel tonight, implemented some process limits, changed the kernel configuration so it should deny additional memory to what is eating memory instead of killing the virtual host.  Also a new version of libvirt, the process that runs virtual machines, came out today.  So if these things don’t fix the issue, I am hoping they will at least help me diagnose the cause.

Maintenance Work 2/8-2/9/17

     Tonight February 8th leading into early Thursday morning February 9th, there will be maintenance outages lasting approximately 15 minutes per service but not all services will be down simultaneously.

     Maintenance work is necessary to address a problem where servers on one physical host which hosts the mail client server keeps invoking the OOM memory killer and killing the mail virtual machine.

     I have not yet been able to identify the culprit that is eating all of the memory.  The machine normally will have around 16GB-18GB of free memory but something will suddenly eat it all up and the OOM killer will kill some random process which is usually the largest process and more often than not that is the mail virtual machine.

     This machine is also the only machine with a 4.4 kernel so it may be a kernel issue.  I did not upgrade this machine to 4.8 along with the others because 4.8 had issues with mandatory locks in it’s nfsv4 code.  I have since learned that this is only a problem on the client side so will not affect the server.

     I have implemented limits in /etc/security/limits.conf which should be adequate for normal operation and at the same time limit memory consumption below what would cause the machine to invoke the OOM killer.

     The systemd scripts are not reliable on machines with RAID disk partitions and sometimes hang instead of rebooting.  They also do not work properly with some CPUs but that is a kernel issue.

     So I will be going to the co-location facility so that I can be there live and in person in case anything goes wrong.  Under the best of circumstances it takes about 15 minutes to boot these machines because it saves the existing virtual machine states before rebooting, boots, the restores those machines to their previous state.  These saves can involve writing some very large files.

     It is possible, if the limits I have set are two low, that more than one reboot may be required to adjust them.

     Then in addition I will be rebooting the server that is the NFS server for home directories and and the host for a few virtual machines in order to load a kernel that addresses some security issues.

 

Centos6 Maintenance Completed

     The kvm/qemu configuration issues have been corrected for Centos6.  This problem came about because these virtual machines were moved from older different physical hosts with different CPUs and I did not correct the configuration at the time.

     Until recently it did not seem to cause any problem as the older CentOS kernels apparently did not utilize any of the unsupported features but recent upgrades changed that and caused the systems to become unstable.