The kernel issues are not resolved but I am making progress.
I have been running customized kernels for eons because I can get more efficiency and better response than the kitchen sink kernels that are distributed with a Linux distribution like Ubuntu.
From 5.15 forward there have been some minor stability issues that became major in 5.17 and remains through 6.0 although rc0-rc4 were stable.
I wasn’t aware that my specific configuration was causing issues, but the fact that nobody else on bugzilla.kernel.org seemed to be having this issue caused me to take a look at that possibility.
To test it, I installed Ubuntu 22.10 on my workstation. I try to avoid non-LTS releases in general, but I installed this release because it had a 5.19 kernel. I then tried that kernel on four servers, it was stable.
Now, two possibilities remained, either Ubuntu had fixed stability issues in their fork of the kernel OR their configuration was stable. To test which of these was the issue, I took 5.19.11, the most current release, and compiled it using Ubuntu’s configuration file. Now when I did this it altered many settings on it’s own, I’m not sure if this was because Ubuntu has kernel hacks the mainline kernel doesn’t or because of changes between 5.19.0 upon which Ubuntu’s kernel was based or 5.19.11 upon which my kernel was based. None the less, this kernel also appears stable. I am going to let it run for a few days on four of our busiest servers to test.
If it runs clean for a few days I will install this kernel on all of our servers. It will slow them down a tad but hopefully this is temporary. My intent is once I get to a point of stability, I will add one of my changes at a time and test. This way I can identify exactly what configuration option is causing issues, then I can look and see what code is affected and from that be able to file a much more detailed and specific bug report which hopefully ultimately will lead to a fix.
The 5.18.19 kernel which previously was stable is now showing instabilities and is showing the same CPU stalls as the 6.0.0 kernel which points to a possible configuration issue on my part. The only things I’ve intentionally changed between this and previous 5.18.19 kernels was to enable a bunch of Intel reptoline CPU security protections. It may be that one or more of those is broken. I’m going to go back to a stock configuration with only the scheduling changed as I previously had and then if it is stable introduce one of these options at a time to try to determine which is at fault. I will not be able to do this tonight but should be able to get in place tomorrow night. I’m going to try this with 6.0.0-rc6 since it is based on the same conf the current 5.18.19 kernel is AND showing the same symptoms (CPU stalls).
I’ve had several people ask about the issues with the 6.0.0-rc5 and 6.0.0-rc6 (so far) kernels, so far the developers have not picked up my bug report. For those interested in the details, here is the URL: https://bugzilla.kernel.org/show_bug.cgi?id=216501
I did not finish downgrading all the machines to 5.18.19 but I did finish the physical hosts, the mail servers (client and both incoming), the web server, ubuntu, debian, centos7, and slinux-7. All the machines still on 6.0.0-rc5 are virtual machines so I can reboot them from here if necessary.
I will be working on changing the remainder Thursday evening.
Correction Time Zone is Pacific Daylight Time (PDT) Not PST.
I am going to revert as many machines as I can tonight back to 5.18.19 kernel starting at 11pm. I do not have time to prepare all the machines but what I do not get done will be finished tomorrow evening. The 6.0.0 kernel up through RC4 was good, but RC5 and later has severe CPU stalls just like 5.19.x did and the kernel development people seem to be basically ignoring it. 5.18.19 is at END of LIFE which means it is not getting any security updates or fixes, but at least it is stable. Nothing after it, currently available, works. If you’re seeing long load times in mail, etc, this is the cause. So there will be some outages tonight from 11PM until I can’t work any later and then again tomorrow at 11PM though probably less service affecting as I will focus on the physical servers and the more busy servers, web, mail, and ubuntu and debian shell servers tonight.
This also affects all of our Fediverse services, https://friendica.eskimo.com, https://www.hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.
I’ve put the PHP default back to 7.4 because the APCu and Memcache PHP modules are not working for versions 8-8.2.
I’ve upgraded the versions of PHP available now to include 5.6, 7.0, 7.1, 7.2, 7.3, 7.4, 8.0, 8.1, and 8.2, 8.2 isn’t entirely complete yet so I don’t recommend it’s use though it is fundamentally functional but no redis cache, no memcache, etc. I have not yet updated the documentation on how to select 8.1 or 8.2, but it is basically the same scheme as all the other versions.
I have changed the default version of PHP from 7.4 to 8.0 and am in the process of upgrading various web applications to the latest versions to support PHP 8.0, so if something isn’t working on our website yet, it should be fixed by the end of this evening.
Scientific7.eskimo.com is down. I broke it in an attempt to convert the /boot file system from 128 bit inodes to 256 bit inodes to accommodate 64-bit timestamps so that it can work beyond 2038, though why, given the end of life is ten years before that, is somewhat lost on me. At any rate, restoring from backups, will probably take about an hour.
I will be performing a kernel upgrade of all Eskimo North servers starting at 11PM tonight, I expect to conclude by 11:30PM. If all goes well, no individual service should be down for more than about ten minutes, most fewer.
This will include shell servers, e-mail, web hosting, and fediverse services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.