This is why I do not use the Yahoo group for status when we have an outage and instead ask people to check here if our own website is down.
Your message to the EskimoNorthUsers group was not approved.
The owner of the group controls the content posted to it and has the
right to approve or reject messages accordingly.
In this case, your message was automatically rejected because the
moderator didn’t approve it within 14 days. We do this to provide a
high quality of service for our users.
A complete copy of your message has been attached for your
Thank you for choosing Yahoo Groups
Yahoo Groups Customer Care
The new replacement Gigabit switch, a TP-Link, failed less than a month after being placed into service. Suffice it to say that I’m going to try to find something other than another TP-Link to replace it with.
Bought a different brand, it will be here in a few days. With any luck I’ll be able to replace it Saturday night.
No process has run up against the memory limits I put in place on Iglulik and memory usage has remained stable since the kernel upgrade to 4.8.
This makes it likely that the issue was a bug in the 4.4 kernel on that machine. It was the only machine I had running that particular kernel.
It is possible that whatever went wrong before just hasn’t happened again yet, but if it does hopefully the limits in place will identify the process, prevent the virtual machines being killed by the Linux OOM killer, and provide some diagnostic information.
Centos6 ships with php 5.3, a rather antique version and not the version we use on our web server. I installed php70 from remi repositories but it did not work as advertised, upgrade php did not remove all pieces of 5.3, so we ended up with a mix of php 5.3 and php 7.0 that did not play together well and caused errors on every update.
I deleted ALL php from this machine today and am in the process of re-installing PHP 7.0 so temporarily there will be many missing modules until this process is completed.
I’ve completed the maintenance I had planned for this tonight. Iglulik which is the server that hosts the mail spool and several virtual machines, including mail, has been sporadically running out of memory invoking the Linux OOM killer which kills the largest process and that is usually the mail server process however I don’t know that it is it that is consuming the memory as it appears normal sized in the kernel error messages. I suspect something else is but is very transient so by the time the killer walks the process table, the real culprit is gone.
I have upgraded this machine’s kernel from 4.4 to 4.8 kernel tonight, implemented some process limits, changed the kernel configuration so it should deny additional memory to what is eating memory instead of killing the virtual host. Also a new version of libvirt, the process that runs virtual machines, came out today. So if these things don’t fix the issue, I am hoping they will at least help me diagnose the cause.
Tonight February 8th leading into early Thursday morning February 9th, there will be maintenance outages lasting approximately 15 minutes per service but not all services will be down simultaneously.
Maintenance work is necessary to address a problem where servers on one physical host which hosts the mail client server keeps invoking the OOM memory killer and killing the mail virtual machine.
I have not yet been able to identify the culprit that is eating all of the memory. The machine normally will have around 16GB-18GB of free memory but something will suddenly eat it all up and the OOM killer will kill some random process which is usually the largest process and more often than not that is the mail virtual machine.
This machine is also the only machine with a 4.4 kernel so it may be a kernel issue. I did not upgrade this machine to 4.8 along with the others because 4.8 had issues with mandatory locks in it’s nfsv4 code. I have since learned that this is only a problem on the client side so will not affect the server.
I have implemented limits in /etc/security/limits.conf which should be adequate for normal operation and at the same time limit memory consumption below what would cause the machine to invoke the OOM killer.
The systemd scripts are not reliable on machines with RAID disk partitions and sometimes hang instead of rebooting. They also do not work properly with some CPUs but that is a kernel issue.
So I will be going to the co-location facility so that I can be there live and in person in case anything goes wrong. Under the best of circumstances it takes about 15 minutes to boot these machines because it saves the existing virtual machine states before rebooting, boots, the restores those machines to their previous state. These saves can involve writing some very large files.
It is possible, if the limits I have set are two low, that more than one reboot may be required to adjust them.
Then in addition I will be rebooting the server that is the NFS server for home directories and and the host for a few virtual machines in order to load a kernel that addresses some security issues.
The kvm/qemu configuration issues have been corrected for Centos6. This problem came about because these virtual machines were moved from older different physical hosts with different CPUs and I did not correct the configuration at the time.
Until recently it did not seem to cause any problem as the older CentOS kernels apparently did not utilize any of the unsupported features but recent upgrades changed that and caused the systems to become unstable.
I am taking Centos6 down briefly to fix a similar issue with the qemu/kvm configuration specifying CPU features that are not supported.
KVM/Qemu emulation configuration for mail has been corrected so that it is no longer attempting to use non-existent CPU features (trying to use IvyBridge features on a SandyBridge CPU).
I need to shut the client mail server (provides pop3, imap, smtp for clients and web mail) for just a few minutes to correct a virtual machine setting that is requesting CPU features not supported on the current hardware. I believe this may be causing some of the instabilities. At the very least it’s a configuration error that needs to be fixed.