From kernels 5.15 forward, we’ve had issues with expedited RCU CPU stalls on our servers.
I’ve experimented with kernels configured per the stock Ubuntu configuration and these same kernels do NOT show expedited RCU CPU stalls.
The RCU system is Read-Copy-Update, it is a means to allow read concurrency with updates without requiring a lock resulting in greater efficiency in modern multi-core CPUs or in multiple CPU systems. If you are interested in details read https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html.
The kernels I will be putting in place tonight are 5.19.12 and in a few cases 5.19.11 (I started the update working with 5.19.11 then kernel.org came out with 5.19.12), with a configuration closely resembling the Ubuntu “generic” kernels, which is to say it won’t be entirely tickless, only idle tickless and it won’t be entirely non-preemptive, but will allow voluntary preemption. This is less efficient than our normal kernels but a weeks worth of testing has shown it to be stable on four of our busiest servers.
I will be further testing a kernel that is voluntarily preemptive but completely tickless on the four busiest servers to see if that is stable. I’ve been testing this configuration on my workstation and other than higher latency you get with non-preemptive kernels, it has been stable. If this works out we will adopt this configuration on the rest of the servers next Friday.
This will affect all Eskimo services, shared web hosting, shell services, e-mail, virtual private servers, as well as our Fediverse services, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and http://yacy.eskimo.com/.
The updates will begin at 11PM and should be completed by 11:30PM Pacific Daylight Time (GMT-0700) tonight Friday September 30th, 2022. No single service should be down for more than about ten minutes.
A customer wrote:
I think the only reason manjaro isn’t used as much is because it isn’t available on eskimo.com, only on yellow-snow. I’ve tried:
$ ssh email@example.com ssh: Could not resolve hostname manjaro.eskimo.com: Name or service not known and gave up.
So, I’ve added a CNAME in eskimo.com so you can ALSO reach it as manjaro.eskimo.com.
According to the logs, nobody has used Manjaro since April, so I am considering removing it if it is just a waste of resources. But the logs do not capture visual sessions because x2go does not write the /etc/wtmp and /etc/utmp files.
If anyone has any objection to this machine going away, please make your concerns known now.
We recently upgraded squirrelmail to be compatible with PHP 8.0+.
The upgrade unfortunately disabled some features with respect to theme color and font choices that some of our customers preferred.
Consequently I’ve restored the old version along with configuring this application to use PHP 7.4 for which it is designed.
I’ve been trying to find some newer more modern web mail clients, we do also have raintree and roundcube but those tend to be optimized for portable devices. Squirrelmail is the oldest and least well supported of the web mail clients that we have, yet it remains by far the most popular which makes it a challenge.
From 5.15 onwards there seems to be an incompatibility between tickless and non-preemptive options. If I select either one by itself the kernel seems to be stable, if I select both, I get RCU expedited CPU stalls. So this is not so easy to sort out because each of these options by itself triggers a dozen or more other selections so this can not so easily be isolated to a specific bit of code. For now I’m going to go with tickless and voluntary preemption. This seems to suffice for stopping RCU expedited CPU stalls and isn’t really harming efficiency since any job that voluntarily gives up a CPU isn’t that high priority anyway.
As a consequence, I am scheduling a kernel upgrade for next Friday Sept 30th starting at 11pm though I will be installing new kernels on all the machines just not rebooting sooner so if they spontaneously boot they will boot into a new kernel.
The kernel issues are not resolved but I am making progress.
I have been running customized kernels for eons because I can get more efficiency and better response than the kitchen sink kernels that are distributed with a Linux distribution like Ubuntu.
From 5.15 forward there have been some minor stability issues that became major in 5.17 and remains through 6.0 although rc0-rc4 were stable.
I wasn’t aware that my specific configuration was causing issues, but the fact that nobody else on bugzilla.kernel.org seemed to be having this issue caused me to take a look at that possibility.
To test it, I installed Ubuntu 22.10 on my workstation. I try to avoid non-LTS releases in general, but I installed this release because it had a 5.19 kernel. I then tried that kernel on four servers, it was stable.
Now, two possibilities remained, either Ubuntu had fixed stability issues in their fork of the kernel OR their configuration was stable. To test which of these was the issue, I took 5.19.11, the most current release, and compiled it using Ubuntu’s configuration file. Now when I did this it altered many settings on it’s own, I’m not sure if this was because Ubuntu has kernel hacks the mainline kernel doesn’t or because of changes between 5.19.0 upon which Ubuntu’s kernel was based or 5.19.11 upon which my kernel was based. None the less, this kernel also appears stable. I am going to let it run for a few days on four of our busiest servers to test.
If it runs clean for a few days I will install this kernel on all of our servers. It will slow them down a tad but hopefully this is temporary. My intent is once I get to a point of stability, I will add one of my changes at a time and test. This way I can identify exactly what configuration option is causing issues, then I can look and see what code is affected and from that be able to file a much more detailed and specific bug report which hopefully ultimately will lead to a fix.
The 5.18.19 kernel which previously was stable is now showing instabilities and is showing the same CPU stalls as the 6.0.0 kernel which points to a possible configuration issue on my part. The only things I’ve intentionally changed between this and previous 5.18.19 kernels was to enable a bunch of Intel reptoline CPU security protections. It may be that one or more of those is broken. I’m going to go back to a stock configuration with only the scheduling changed as I previously had and then if it is stable introduce one of these options at a time to try to determine which is at fault. I will not be able to do this tonight but should be able to get in place tomorrow night. I’m going to try this with 6.0.0-rc6 since it is based on the same conf the current 5.18.19 kernel is AND showing the same symptoms (CPU stalls).
I’ve had several people ask about the issues with the 6.0.0-rc5 and 6.0.0-rc6 (so far) kernels, so far the developers have not picked up my bug report. For those interested in the details, here is the URL: https://bugzilla.kernel.org/show_bug.cgi?id=216501
I did not finish downgrading all the machines to 5.18.19 but I did finish the physical hosts, the mail servers (client and both incoming), the web server, ubuntu, debian, centos7, and slinux-7. All the machines still on 6.0.0-rc5 are virtual machines so I can reboot them from here if necessary.
I will be working on changing the remainder Thursday evening.
Correction Time Zone is Pacific Daylight Time (PDT) Not PST.