I’ve put the PHP default back to 7.4 because the APCu and Memcache PHP modules are not working for versions 8-8.2.
I’ve upgraded the versions of PHP available now to include 5.6, 7.0, 7.1, 7.2, 7.3, 7.4, 8.0, 8.1, and 8.2, 8.2 isn’t entirely complete yet so I don’t recommend it’s use though it is fundamentally functional but no redis cache, no memcache, etc. I have not yet updated the documentation on how to select 8.1 or 8.2, but it is basically the same scheme as all the other versions.
I have changed the default version of PHP from 7.4 to 8.0 and am in the process of upgrading various web applications to the latest versions to support PHP 8.0, so if something isn’t working on our website yet, it should be fixed by the end of this evening.
Scientific7.eskimo.com is down. I broke it in an attempt to convert the /boot file system from 128 bit inodes to 256 bit inodes to accommodate 64-bit timestamps so that it can work beyond 2038, though why, given the end of life is ten years before that, is somewhat lost on me. At any rate, restoring from backups, will probably take about an hour.
I will be performing a kernel upgrade of all Eskimo North servers starting at 11PM tonight, I expect to conclude by 11:30PM. If all goes well, no individual service should be down for more than about ten minutes, most fewer.
This will include shell servers, e-mail, web hosting, and fediverse services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.
Reboots are completed. I did not have any issues with the web server running for five days on the new kernel, but 1/2 hour after reboot it failed, so not sure if a fluke or a real problem. I’ll keep an eye on it.
The release candidate kernel, 6.0.0-rc4 is working better than any kernel I’ve ever used to date, better than the last three stable releases by far in terms of performance, stability, and security. I’ve got it running on four of our busiest servers presently, and unlike 5.19.x it has not produced any CPU stalls or OOPS and runs even faster and this with all the known reptoline mitigations in place. 5.18 was stable but unusable owing to the kernel spending more time in system than user space, 5.19 remedied this but introduced some nasty instabilities, but 6.0rc4 appears to fix everything.
So if it continues to run clean on those servers, we will be upgrading the remaining servers to 6.0.0rc4 (or rc5 if it comes out between now and then) on Friday September 9th, starting at 11pm and should finish by about 11:30pm Pacific Daylight Time (-0700 GMT).
This will affect all Eskimo North services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, https://yacy.eskimo.com/, and our own website and related services at https://www.eskimo.com/ as well as virtual private servers, shared hosting, and Linux shell servers. I do not expect downtime to be more than about ten minutes on any one service.
At present, as near as I can determine there is no Linux kernel which is simultaneously:
1) Up to date with regards to outstanding security issues.
2) Entirely stable and functional.
3) Performs well and doesn’t spend more time in the kernel than in userland.
It is possible to get any two out of three of these but not all three in any one kernel at present. 5.15 proves #1 and #2. but does not provide #3 and isn’t adequate especially for busy machines like the web and mail servers. 5.18 provides #2 and #3, but not #1, and 5.19 provides #1 and #3 but not #2. 6.0.0-rc4 is looking promising but not enough testing yet to see if it’s going to be stable.
So we are still working with developers and testing various kernels and configurations to attempt to achieve all three of these. At present, I am running 5.19 on most non-critical servers that have shell access to customers since security is important on these and 5.18.19 on machines that have very little public exposure. Seems like the best compromise at the time, but I will be trialing 6.0.0-rc4 on various machines. There will be brief outages of web and mail as I reboot into this kernel to test.
We will be performing kernel regression testing to try to determine what commit introduced the rcu expedited timeout bug in the 5.19 kernel to assist kernel developers in correcting the situation. Because it can take several weeks for this to manifest on an unbusy workstation, I will be trialing various kernels on our web server and mail server which are busy enough that they generate errors in a short time. This may result in brief (under 2 minute) interruptions in service on our mail and web services but are necessary to determine the cause and eliminate this issue on Linux kernels moving forward.
Kernel upgrades have been completed. All servers are back in service. All NFS and NIS relationships are re-established. All services are up.
We will be upgrading to kernel 5.19.4 this Friday at 11PM. I don’t know if this will address the CPU stalls or not. The number of changes to the kernel since 5.19.3 are so voluminous that I don’t have time to review all of them to see if any of them impact this problem. I do know that the issue we had is not apparently in the KVM code itself as it has also occurred in the physical hosts and in at least four different parts of the kernel, so probably a function all of these are using or something doing wild writes in memory. The ticket I opened is still not resolved. But since we know that this problem exists at least from 5.19.0-5.19.3 and going back to 5.17 is not an option since it’s past EOF and there are some serious exploits not fixed in that kernel, moving forward is the only viable option at this point.
The expected interval is approximately 1/2 hour with no single service being down for more than about 10-15 minutes.
This will affect ALL of Eskimo’s host services including virtual private servers, web and e-mail hosting, and our fediverse services https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com.