Manjaro Response

A customer wrote:

I think the only reason manjaro isn’t used as much is because it isn’t  available on eskimo.com, only on yellow-snow.  I’ve tried:

$ ssh user@manjaro.eskimo.com ssh: Could not resolve hostname manjaro.eskimo.com: Name or service not known and gave up.

So, I’ve added a CNAME in eskimo.com so you can ALSO reach it as manjaro.eskimo.com.

Manjaro – Anyone Using It?

     According to the logs, nobody has used Manjaro since April, so I am considering removing it if it is just a waste of resources.  But the logs do not capture visual sessions because x2go does not write the /etc/wtmp and /etc/utmp files.

     If anyone has any objection to this machine going away, please make your concerns known now.

 

SquirrelMail

     We recently upgraded squirrelmail to be compatible with PHP 8.0+.

     The upgrade unfortunately disabled some features with respect to theme color and font choices that some of our customers preferred.

     Consequently I’ve restored the old version along with configuring this application to use PHP 7.4 for which it is designed.

     I’ve been trying to find some newer more modern web mail clients, we do also have raintree and roundcube but those tend to be optimized for portable devices.  Squirrelmail is the oldest and least well supported of the web mail clients that we have, yet it remains by far the most popular which makes it a challenge.

Kernel Issues

     From 5.15 onwards there seems to be an incompatibility between tickless and non-preemptive options.  If I select either one by itself the kernel seems to be stable, if I select both, I get RCU expedited CPU stalls.  So this is not so easy to sort out because each of these options by itself triggers a dozen or more other selections so this can not so easily be isolated to a specific bit of code.  For now I’m going to go with tickless and voluntary preemption.  This seems to suffice for stopping RCU expedited CPU stalls and isn’t really harming efficiency since any job that voluntarily gives up a CPU isn’t that high priority anyway.

     As a consequence, I am scheduling a kernel upgrade for next Friday Sept 30th starting at 11pm though I will be installing new kernels on all the machines just not rebooting sooner so if they spontaneously boot they will boot into a new kernel.

Kernel Issues

     The kernel issues are not resolved but I am making progress.

     I have been running customized kernels for eons because I can get more efficiency and better response than the kitchen sink kernels that are distributed with a Linux distribution like Ubuntu.

     From 5.15 forward there have been some minor stability issues that became major in 5.17 and remains through 6.0 although rc0-rc4 were stable.

     I wasn’t aware that my specific configuration was causing issues, but the fact that nobody else on bugzilla.kernel.org seemed to be having this issue caused me to take a look at that possibility.

     To test it, I installed Ubuntu 22.10 on my workstation.  I try to avoid non-LTS releases in general, but I installed this release because it had a 5.19 kernel.  I then tried that kernel on four servers, it was stable.

     Now, two possibilities remained, either Ubuntu had fixed stability issues in their fork of the kernel OR their configuration was stable.  To test which of these was the issue, I took 5.19.11, the most current release, and compiled it using Ubuntu’s configuration file.  Now when I did this it altered many settings on it’s own, I’m not sure if this was because Ubuntu has kernel hacks the mainline kernel doesn’t or because of changes between 5.19.0 upon which Ubuntu’s kernel was based or 5.19.11 upon which my kernel was based.  None the less, this kernel also appears stable.  I am going to let it run for a few days on four of our busiest servers to test.

     If it runs clean for a few days I will install this kernel on all of our servers.  It will slow them down a tad but hopefully this is temporary.  My intent is once I get to a point of stability, I will add one of my changes at a time and test.  This way I can identify exactly what configuration option is causing issues, then I can look and see what code is affected and from that be able to file a much more detailed and specific bug report which hopefully ultimately will lead to a fix.

Kernel Issues

     The 5.18.19 kernel which previously was stable is now showing instabilities and is showing the same CPU stalls as the 6.0.0 kernel which points to a possible configuration issue on my part.  The only things I’ve intentionally changed between this and previous 5.18.19 kernels was to enable a bunch of Intel reptoline CPU security protections.  It may be that one or more of those is broken.  I’m going to go back to a stock configuration with only the scheduling changed as I previously had and then if it is stable introduce one of these options at a time to try to determine which is at fault.  I will not be able to do this tonight but should be able to get in place tomorrow night.  I’m going to try this with 6.0.0-rc6 since it is based on the same conf the current 5.18.19 kernel is AND showing the same symptoms (CPU stalls).

Kernel Downgrades

     I did not finish downgrading all the machines to 5.18.19 but I did finish the physical hosts, the mail servers (client and both incoming), the web server, ubuntu, debian, centos7, and slinux-7.  All the machines still on 6.0.0-rc5 are virtual machines so I can reboot them from here if necessary.

     I will be working on changing the remainder Thursday evening.

 

Emergency Kernel Reversion Tonight/Tomorrow 11PM PDT (GMT-0700)

     I am going to revert as many machines as I can tonight back to 5.18.19 kernel starting at 11pm.  I do not have time to prepare all the machines but what I do not get done will be finished tomorrow evening.  The 6.0.0 kernel up through RC4 was good, but RC5 and later has severe CPU stalls just like 5.19.x did and the kernel development people seem to be basically ignoring it.  5.18.19 is at END of LIFE which means it is not getting any security updates or fixes, but at least it is stable.  Nothing after it, currently available, works.  If you’re seeing long load times in mail, etc, this is the cause.  So there will be some outages tonight from 11PM until I can’t work any later and then again tomorrow at 11PM though probably less service affecting as I will focus on the physical servers and the more busy servers, web, mail, and ubuntu and debian shell servers tonight.

    This also affects all of our Fediverse services, https://friendica.eskimo.com, https://www.hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com/.