Just found out a long time customer, Dave Bruels who had Interlake China Tours, passed yesterday. I’ve spent many lunches with him and enjoyed his photos from China. Great man and will miss him. His wife had passed not long before and whatever the medical reason, I’m sure he passed of a broken heart. She was a wonderful woman who fought cancer with a courage you rarely see in a human being, hiking and doing the things she loved right up to the end.
Author Archives: nanook
Zorin Borked
Zorin will be down for a few days, well partially. It won’t be fully operational again for several days. Because Zorin 15 is based upon Ubuntu 16.04, there is NO support for bpfilters, and the currently Linux kernel has deprecated the old iptables filtering scheme so fail2ban and other firewall features do not work.
Because even though Zorin is a complete rip-off of Ubuntu, if one performs a do-release-upgrade it only updates parts and you end up with a Mix of Ubuntu 16.04 and 20.04 that does not play well together.
Thus the ONLY way to upgrade Zorin to Zorin 16, which is STILL based on a 2-1/2 year old Ubuntu, 20.04, is a fresh install which means re-installing all the applications and re-configuring everything which is several days work. I paid for the “Pro” version of the previous release under the promise by Zorin developers that they were working on an in-place upgrade but a year and a half later it is still vaporware so I am only installing Zorin core this time around.
6.0 Kernel Working Well
The 6.0 kernel is working well, so far there has been only one expedited RCU CPU stall and that was on a virtual host during boot-up on a machine that has quite a few guests. These can occur at this time just because ALL of the guests are busy with start-up and it simply exceeds available CPU cycles for a short time during boot-up. If it continues to run this clean until this Friday then we’ll upgrade the physical servers to this kernel on that date at 11PM.
Linux Official 6.0 Release
At this point I have Linux kernel 6.0 installed on all but private virtual machines and the physical hosts. So far it’s running MUCH better than rc5 or rc6, but not perfect. I’ve gotten ONE expedited RCU CPU stall in a days time on a dozen or so machines. With 6.0rc5-6 I would see dozens in the same time frame.
The scheduler has been substantially re-worked in the 6.0 kernel, most of these changes were aimed at AMD CPUs but it made for substantial performance improvements on Intel as well. One stall in one day across a dozen machines means one process hung for up to two minutes in that time frame, not a HUGE concern but enough that I don’t want to put it on the physical hosts yet. The scheduling changes cut the latency for the web server down by approximately 6 fold, that ain’t chicken feed!
I’ve posted the most recent expedited RCU CPU stall with the bug report I had filed with bugzilla.kernel.org, for anyone interested it is bug #216501 and you can read the details here: https://bugzilla.kernel.org/show_bug.cgi?id=216501.
Kernel Progress?
I received a note from bugzilla.kernel.org that another user found 6.0.0rc7 no longer had the expedited RCU CPU stalls issue, but before I could grab and try it, the official release of 6.0 came out, so I’m going to build it and give it a try on a few select servers. If it runs clean during the week I’ll propagate it to the other machines this coming weekend.
Service Affecting Kernel Upgrades Completed
Kernel upgrades are completed on the physical hosts and critical machines. They are not done on vps1-vps7, but those machines are all single core and single core machines are not experiencing the CPU stall bugs in 5.16-6.0 kernels.
Some of the other shell servers also are not updated yet but these are sufficiently non-busy that I can easily hit them when nobody is logged in, and for many I have to build kernels yet before I can update. 5.15.71 booted cleanly on all the machines that it is presently installed on.
Kernel Upgrades 11pm Oct 2nd PST (GMT-0700)
Things are running smoothly but load is high with stock 5.15.0 kernel from Ubuntu. I’ve configured the latest 5.15 kernel (5.15.71) and installed on a number of machines and it is also running well with no forced preemption, 100HZ clock, and fully tickless kernel. This reduces overhead somewhat and so I am going to install on the physical servers tonight at 11PM which will require rebooting everything. This will result in some downtime between 11PM-11:30PM, not more than about 10 minutes for any given service.
I will be installing this kernel on most of the other machines during the week but I need to get it on the physical hosts tonight.
This will affect all Eskimo North services, including private virtual servers, shell servers, shared web hosting packages such as virtual domains, personal and business web hosting packages, and e-mail.
It will also affect our fediverse services https://friendica.eskimo.com/ (a fediverse social media site), https://hubzilla.eskimo.com/ (another fediverse social media site), https://nextcloud.eskimo.com/ (a federated cloud service), and https://yacy.eskimo.com/ (a federated non-censored search engine).
5.19.12 Broken
Even though 5.19.11 ran okay on four busy servers for a week, 5.19.12 is NOT running well, and 5.19.11 is no longer available so I’m going to go back to a stock 5.15 Ubuntu kernel on most of the machines. Tonight there will be additional reboots between 11pm-11:30pm to this end.
It seems to me that some fundamental change was made to the Linux kernel in 5.16 and forward that greatly improved scheduling and context switching efficiency but introduced serious stability problems that are not yet addressed. So at this point I’m going back to a stock 5.15 kernel and when the official release of 6.0 comes out we’ll experiment with that. I may also experiment with some custom configurations of 5.15 to see if we can’t improve the scheduling and context switching efficiency of that kernel somewhat.
New Kernel Failed
The new kernel is not running well on our main NFS server so another reboot of iglulik, which is the server that provides the /home directories, was required to load a different kernel. Iglulik is now running the stock Ubuntu 22.04 5.15.0 kernel. Less than idea but 5.19.12 did not run well on it. We had CPU stalls but it wasn’t the usual expedited RCU CPU stalls, it was a more generic 2 minute CPU stall that periodically broke NFS.
Also, I had to switch iptables from legacy to nftables (which uses bpf)) because the legacy iptables is no longer supported. I had to do this for vps4, vps5, and vps7.
Kernel Upgrades and Expedited RCU CPU Stalls
From kernels 5.15 forward, we’ve had issues with expedited RCU CPU stalls on our servers.
I’ve experimented with kernels configured per the stock Ubuntu configuration and these same kernels do NOT show expedited RCU CPU stalls.
The RCU system is Read-Copy-Update, it is a means to allow read concurrency with updates without requiring a lock resulting in greater efficiency in modern multi-core CPUs or in multiple CPU systems. If you are interested in details read https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html.
The kernels I will be putting in place tonight are 5.19.12 and in a few cases 5.19.11 (I started the update working with 5.19.11 then kernel.org came out with 5.19.12), with a configuration closely resembling the Ubuntu “generic” kernels, which is to say it won’t be entirely tickless, only idle tickless and it won’t be entirely non-preemptive, but will allow voluntary preemption. This is less efficient than our normal kernels but a weeks worth of testing has shown it to be stable on four of our busiest servers.
I will be further testing a kernel that is voluntarily preemptive but completely tickless on the four busiest servers to see if that is stable. I’ve been testing this configuration on my workstation and other than higher latency you get with non-preemptive kernels, it has been stable. If this works out we will adopt this configuration on the rest of the servers next Friday.
This will affect all Eskimo services, shared web hosting, shell services, e-mail, virtual private servers, as well as our Fediverse services, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and http://yacy.eskimo.com/.
The updates will begin at 11PM and should be completed by 11:30PM Pacific Daylight Time (GMT-0700) tonight Friday September 30th, 2022. No single service should be down for more than about ten minutes.