At this point I have Linux kernel 6.0 installed on all but private virtual machines and the physical hosts. So far it’s running MUCH better than rc5 or rc6, but not perfect. I’ve gotten ONE expedited RCU CPU stall in a days time on a dozen or so machines. With 6.0rc5-6 I would see dozens in the same time frame.
The scheduler has been substantially re-worked in the 6.0 kernel, most of these changes were aimed at AMD CPUs but it made for substantial performance improvements on Intel as well. One stall in one day across a dozen machines means one process hung for up to two minutes in that time frame, not a HUGE concern but enough that I don’t want to put it on the physical hosts yet. The scheduling changes cut the latency for the web server down by approximately 6 fold, that ain’t chicken feed!
I’ve posted the most recent expedited RCU CPU stall with the bug report I had filed with bugzilla.kernel.org, for anyone interested it is bug #216501 and you can read the details here: https://bugzilla.kernel.org/show_bug.cgi?id=216501.
I received a note from bugzilla.kernel.org that another user found 6.0.0rc7 no longer had the expedited RCU CPU stalls issue, but before I could grab and try it, the official release of 6.0 came out, so I’m going to build it and give it a try on a few select servers. If it runs clean during the week I’ll propagate it to the other machines this coming weekend.
Kernel upgrades are completed on the physical hosts and critical machines. They are not done on vps1-vps7, but those machines are all single core and single core machines are not experiencing the CPU stall bugs in 5.16-6.0 kernels.
Some of the other shell servers also are not updated yet but these are sufficiently non-busy that I can easily hit them when nobody is logged in, and for many I have to build kernels yet before I can update. 5.15.71 booted cleanly on all the machines that it is presently installed on.
Things are running smoothly but load is high with stock 5.15.0 kernel from Ubuntu. I’ve configured the latest 5.15 kernel (5.15.71) and installed on a number of machines and it is also running well with no forced preemption, 100HZ clock, and fully tickless kernel. This reduces overhead somewhat and so I am going to install on the physical servers tonight at 11PM which will require rebooting everything. This will result in some downtime between 11PM-11:30PM, not more than about 10 minutes for any given service.
I will be installing this kernel on most of the other machines during the week but I need to get it on the physical hosts tonight.
This will affect all Eskimo North services, including private virtual servers, shell servers, shared web hosting packages such as virtual domains, personal and business web hosting packages, and e-mail.
It will also affect our fediverse services https://friendica.eskimo.com/ (a fediverse social media site), https://hubzilla.eskimo.com/ (another fediverse social media site), https://nextcloud.eskimo.com/ (a federated cloud service), and https://yacy.eskimo.com/ (a federated non-censored search engine).
Even though 5.19.11 ran okay on four busy servers for a week, 5.19.12 is NOT running well, and 5.19.11 is no longer available so I’m going to go back to a stock 5.15 Ubuntu kernel on most of the machines. Tonight there will be additional reboots between 11pm-11:30pm to this end.
It seems to me that some fundamental change was made to the Linux kernel in 5.16 and forward that greatly improved scheduling and context switching efficiency but introduced serious stability problems that are not yet addressed. So at this point I’m going back to a stock 5.15 kernel and when the official release of 6.0 comes out we’ll experiment with that. I may also experiment with some custom configurations of 5.15 to see if we can’t improve the scheduling and context switching efficiency of that kernel somewhat.
The new kernel is not running well on our main NFS server so another reboot of iglulik, which is the server that provides the /home directories, was required to load a different kernel. Iglulik is now running the stock Ubuntu 22.04 5.15.0 kernel. Less than idea but 5.19.12 did not run well on it. We had CPU stalls but it wasn’t the usual expedited RCU CPU stalls, it was a more generic 2 minute CPU stall that periodically broke NFS.
Also, I had to switch iptables from legacy to nftables (which uses bpf)) because the legacy iptables is no longer supported. I had to do this for vps4, vps5, and vps7.