Maintenance is completed for the evening.
Now we know the E1000 drivers broke in the 4.14.0 kernel and in the mainline kernel so not a bad patch from Canonical at fault but some bad juju in the mainline kernel.
Really hate it when they fart with a device driver for a device that has been out for at least 13 years, and break it. Argh!
Mx2 is back in service on Ubuntu Artful 17.10. I was unable to determine what they upgrade to 18.04 broke but was able to document the error and generated a ticket on Launchpad so Canonical is working on it.
Mx2 isn’t needed from a capacity standpoint, it is just there to provide redundancy for Mx1 to receive mail in the event Mx1 is down.
Tonight, I will be installing a test kernel onto the machine which hosts home directories and some virtual machines tonight and re-enabling hardware offloading.
This is in order to determine exactly what kernel broke the E1000 drivers so they can look between the last working kernel and the one that broke it and figure out which change caused the problem.
This means we will be back in a state where there may be pauses and NFS problems again but only temporary so they can determine which kernel, fix it, and be back to where we can have it work with the performance of hardware offloading.
Because almost everything depends upon access to home directories, this will break pretty much all services for a hopefully brief period.
Mx2 is down for all practical purposes. I have the SMTP ports firewalled off so that I can upgrade this server to 18.04 LTS and compare to mx1 on 17.10 and figure out what is different that is breaking lists. 17.10 is only supported until July so I can’t just wait and hope that someone else fixes it.
I was able to stop the Ethernet hang by disabling some some of the hardware offload functions of the Ethernet chip. So things are stable again with current kernels. This comes at the expense of some performance but not a lot as I was able to leave CRC32, which is more CPU consuming than just moving the data, enabled.
Things went south in a big way. The new kernel they provided caused so many other I could not let it run long enough to see if it fixed the problem because it broke NFS, broke the console, broke the mouse. Getting the system back up on the old kernel was a challenge.
There was another experimental kernel in the developers repository. I am running on it now. It does not fix the problem but it provides much more detailed diagnostics that hopefully will enable that process.
It is 6:30AM now, so I will not be available until after around 3pm this afternoon. Please feel free to leave a message or mail and I will get back to you as soon as possible.
Commencing shortly after midnight and lasting anywhere from 15 minutes to several hours, depending upon how smooth things go, There is a problem with the 3.15.0 kernel drivers for E1000 Ethernet chips which my machines happen to use that causes them to periodically hang for about ten seconds. Most people will experience this as pause in I/O but some non-Linux versions of ssh will timeout and disconnect. I have been working with Canonical to come up with a fix. The first kernel they provided did improve the situation so that it happened a few times a day instead of a few times an hour. They have another kernel for me to try.
So tonight, I will be installing that kernel and rebooting to make it active. I am going to attempt this remotely but in the past sometimes network would not come up after a reboot and if that happens I will need to drive down to the co-lo facility and it will take longer. This will interrupt all services as it affects both the machines hosting the virtual machines and the NFS servers that hold your home directory, mail directory, and a few other common file systems.
I plan to start this work around 12:30AM and if things go well, I should be finished by 1AM, if not it may be as late as 3AM Pacific time.