Emergency Kernel Reversion Tonight/Tomorrow 11PM PDT (GMT-0700)

     I am going to revert as many machines as I can tonight back to 5.18.19 kernel starting at 11pm.  I do not have time to prepare all the machines but what I do not get done will be finished tomorrow evening.  The 6.0.0 kernel up through RC4 was good, but RC5 and later has severe CPU stalls just like 5.19.x did and the kernel development people seem to be basically ignoring it.  5.18.19 is at END of LIFE which means it is not getting any security updates or fixes, but at least it is stable.  Nothing after it, currently available, works.  If you’re seeing long load times in mail, etc, this is the cause.  So there will be some outages tonight from 11PM until I can’t work any later and then again tomorrow at 11PM though probably less service affecting as I will focus on the physical servers and the more busy servers, web, mail, and ubuntu and debian shell servers tonight.

    This also affects all of our Fediverse services, https://friendica.eskimo.com, https://www.hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.

PHP Upgrade

     I’ve upgraded the versions of PHP available now to include 5.6, 7.0, 7.1, 7.2, 7.3, 7.4, 8.0, 8.1, and 8.2, 8.2 isn’t entirely complete yet so I don’t recommend it’s use though it is fundamentally functional but no redis cache, no memcache, etc.  I have not yet updated the documentation on how to select 8.1 or 8.2, but it is basically the same scheme as all the other versions.

     I have changed the default version of PHP from 7.4 to 8.0 and am in the process of upgrading various web applications to the latest versions to support PHP 8.0, so if something isn’t working on our website yet, it should be fixed by the end of this evening.

Scientific7 Down

     Scientific7.eskimo.com is down.  I broke it in an attempt to convert the /boot file system from 128 bit inodes to 256 bit inodes to accommodate 64-bit timestamps so that it can work beyond 2038, though why, given the end of life is ten years before that, is somewhat lost on me.  At any rate, restoring from backups, will probably take about an hour.

Kernel Upgrade Tonight 11PM PDT (GMT-0700)

     I will be performing a kernel upgrade of all Eskimo North servers starting at 11PM tonight, I expect to conclude by 11:30PM.  If all goes well, no individual service should be down for more than about ten minutes, most fewer.

     This will include shell servers, e-mail, web hosting, and fediverse services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.

Reboots Completed

     Reboots are completed.  I did not have any issues with the web server running for five days on the new kernel, but 1/2 hour after reboot it failed, so not sure if a fluke or a real problem.  I’ll keep an eye on it.

Kernel Upgrades Friday Sept 9th 11pm PDT (-0700GMT)

     The release candidate kernel, 6.0.0-rc4 is working better than any kernel I’ve ever used to date, better than the last three stable releases by far in terms of performance, stability, and security.  I’ve got it running on four of our busiest servers presently, and unlike 5.19.x it has not produced any CPU stalls or OOPS and runs even faster and this with all the known reptoline mitigations in place.  5.18 was stable but unusable owing to the kernel spending more time in system than user space, 5.19 remedied this but introduced some nasty instabilities, but 6.0rc4 appears to fix everything.

     So if it continues to run clean on those servers, we will be upgrading the remaining servers to 6.0.0rc4 (or rc5 if it comes out between now and then) on Friday September 9th, starting at 11pm and should finish by about 11:30pm Pacific Daylight Time (-0700 GMT).

     This will affect all Eskimo North services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, https://yacy.eskimo.com/, and our own website and related services at https://www.eskimo.com/ as well as virtual private servers, shared hosting, and Linux shell servers.  I do not expect downtime to be more than about ten minutes on any one service.

Kernel Issues

     At present, as near as I can determine there is no Linux kernel which is simultaneously:

     1) Up to date with regards to outstanding security issues.

     2) Entirely stable and functional.

     3) Performs well and doesn’t spend more time in the kernel than in userland.

     It is possible to get any two out of three of these but not all three in any one kernel at present.  5.15 proves #1 and #2. but does not provide #3 and isn’t adequate especially for busy machines like the web and mail servers.  5.18 provides #2 and #3, but not #1, and 5.19 provides #1 and #3 but not #2.  6.0.0-rc4 is looking promising but not enough testing yet to see if it’s going to be stable.

     So we are still working with developers and testing various kernels and configurations to attempt to achieve all three of these.  At present, I am running 5.19 on most non-critical servers that have shell access to customers since security is important on these and 5.18.19 on machines that have very little public exposure.  Seems like the best compromise at the time, but I will be trialing 6.0.0-rc4 on various machines.  There will be brief outages of web and mail as I reboot into this kernel to test.

Kernel Regression Testing

     We will be performing kernel regression testing to try to determine what commit introduced the rcu expedited timeout bug in the 5.19 kernel to assist kernel developers in correcting the situation.  Because it can take several weeks for this to manifest on an unbusy workstation, I will be trialing various kernels on our web server and mail server which are busy enough that they generate errors in a short time.  This may result in brief (under 2 minute) interruptions in service on our mail and web services but are necessary to determine the cause and eliminate this issue on Linux kernels moving forward.