Iglulik Status

     Sorry the downtime was longer than anticipated.  I ran many stress tests and was not able to get the machine to error, freeze, overheat, or otherwise act up.

     First, I upgraded the BIOS because experience has taught me that newer Asus BIOS software is generally more stable than old Asus software.  On my workstation I had to set the CPU core at 1.39v to be stable at 5Ghz with the older BIOS, the newest allowed me to reduce that to .95v which is easier on the CPU.

     The BIOS has a function where you can save the configuration to a thumb drive and generally you want to do this before an update because it erases all the settings.  It saved just fine but the new BIOS would not read the file created by the old so I had to reconfigure everything by hand.

     I did find some less than optimal settings, for example I had decode above 4G disabled, the problem with this is that it forced Linux to use bounce buffers rather than the hardware DMA’ing directly into the location where the data is required or from it, thus making I/O less efficient, so I fixed that.

     I also increased the CPU core from 1.29v to 1.35v which makes the CPU run slightly hotter and decreased the clock from 4.3 Ghz to 4.2 Ghz so that if it was on the edge of stability it should be better.  However, I ran many stress tests and was unable to get it to fail before I made the changes.

     I ran additional stress tests after completing these changes to make sure temperatures were still within an acceptable range and they were well within safe limits even with the higher CPU core voltage.  So at this point I am just going to watch it and see if it is still a problem.

Iglulik Web Outage

     I have reverted Iglulik to an older known stable kernel and still it spontaneously booted last night (and did not start the web server upon recovery) so now I know there is a hardware problem.

     Tonight shortly after midnight I will be taking this machine down for a while to run some diagnostics to try to identify the hardware problem since nothing is showing up in the logs.  Most likely CPU or memory error, most other things would have been logged.

     It’s been a while since I last did a BIOS update and the last Asus BIOS update I did on my workstation in April greatly improved stability so I will check for a BIOS update while I’m at it.

     Because this server has the /home directories, ALL shell servers and the web server will be out of service and pop/imap will ONLY be able to access your INBOX and no others during this maintenance.

Iglulik Instability

      I believe I have located the major source of instability but unfortunately at a sacrifice to performance.

     I have an Nvidia 210 video card in this machine for the console.  It’s a very low end card but adequate for that purpose, however, in 2019, Nvidia discontinued driver support so I had to switch to using the Linux nouveau driver which given the relatively low performance of the card was not a big deal.

     Well recent Linux kernels have a bug in the driver for this card which results in the card DMA’ing into memory that it has not allocated, and when that memory happens to be used by something else, crash.

     But as it happens Nvidia has again decided to support that card however the drivers, now 340.108, are not compatible with newer kernels so I was forced to go back to 5.4.0 which is considerably less efficient than 5.7.

Iglulik Still Unstable

     5.7.7 kernel was still unstable, so was 5.8rc3, but at least with the latter it logged some information that showed some memory allocations failed with the contiguous memory allocater, a new feature recently introduced into the Linux kernel.

     I am building a new kernel with that disabled, it really isn’t required since there are no huge streaming I/O devices like video that might require it and most everything can DMA through the MMU on this particular machine (which can map disparate memory regions into contiguous memory).  If it does not spontaneously boot into the new kernel, I will boot it this evening.

     There is also the possibility of hardware errors but so far it has not logged any.

 

 

Iglulik Spontaneous Boot

     Iglulik spontaneously rebooted again tonight, this time on 5.7.7 it made it four days between spontaneous boots but this time I discovered what triggered it so I’ve got a bug report files with bugzilla.kernel.org and I’m going to give 5.8pre4 a try if it proves semi-stable on my workstation.  I normally avoid pre-release kernels but 5.7 has been buggy and so far 5.8pre3 has been totally stable on my workstation.

 

Iglulik Reboot Tonight

     One of our servers has been unstable on 5.7.6 and rebooted spontaneously twice in the last few days.  Oddly, only this server seems to be impacted but it is a newer CPU than the others so it may be a kernel problem specific to this CPU.

     I am going to reboot into 5.7.7 tonight IF it hasn’t spontaneously booted into it on it’s own between now and then.  This will happen just after midnight.

     This machine services the web, /home directories, and several shell servers.  Because basically everything relies on /home, everything will be briefly interrupted shortly after midnight except virtual private servers which will not be affected.

     If you are not on Mint, Debian, or Ubuntu, you should just see things lock up briefly, if you are on one of these servers you will be disconnected and will need to re-establish your connection after the boot completes.

OpenSuse Shell Server to be Discontinued

     I am discontinuing opensuse.eskimo.com shell server because users have been unable to authenticate for several months owing to a broken library in opensuse that incorrectly attempts to originate connections to ypserv on an unprivileged port.

     I filed a bug report with Suse several months ago, nothing has come of it in the way of a fix.

     I attempted to login to the bug reporting system but was informed that authentication methods had changed and I would have to convert my password.  I’ve tried to do that many times but their system keeps telling me the system module is down.

     Obviously maintenance is not happening there anymore so I’m going to abandon OpenSuse and take suggestions for a new Linux distro to replace it.  I prefer distros that are based upon .deb packages verses RPM’s given the choice, the former just has much less tendency to scramble it’s database and have problems with dependencies.

Spontaneous Boot and Resulting Issues

     Iglulik spontaneously booted today.

     NFS partitions did not properly remount on one mail server, this may result in some spam not being properly filtered and mail that should have gone to spam and/or other folders, instead being placed in your INBOX.  I apologize for this inconvenience.

     I have modified systemd unit files for postfix to make these mounts a requirement for postfix to start but unfortunately there is no provision to kill postfix if they go away.  I may be able to script something if I can find a way to check a mount point without the check itself hanging.

Mail / Vps 1-7

     Going to take the mail subsystem down for about 1/2 hour to troubleshoot kernel problem as well as vps1-7 virtual private servers.  Had planned this for yesterday but problems building kernels delayed.