Stability – Interim

     I have the physical hosts up on 5.4.0 now, this is the kernel that shipped with Ubuntu 20.04.  It is not entirely stable on machines which serve as NFS servers which two of these machines do.

     I have built 5.3.0 kernels which were previously stable but can not boot into them because automatic backups are running.  Also I don’t know if MY 5.3 kernel will be stable as Canonical applies some patches that will not be present but the 5.3 kernel is no longer available as 19.10 is past END OF LIFE now.  I will attempt to do so tomorrow evening.

Kernel Issue Emergency Reboot

     Sorry I had to do an emergency reboot on servers today.  I discovered what caused this morning’s crash and it was a bug in the kernel I was running that caused it to not recover resources as processes exited and new ones were created, so it would continue to eat memory until it ran out then crash and reboot.

     I don’t know if this present kernel is trouble free but I do know it does not have this particular problem and NFS seems to work correctly in it, which has been the main reason for all the kernel experimentation lately.

Mail Servers

      After I installed a newer version of fail2ban on the client mail server, the attackers were no longer successful at overloading the machine as it was able to keep up with changing attacking IP addresses.

      Now they are attacking our incoming mail servers, and although they are increasing the load on these machines they can not elicit login credentials from these machines as they are not setup for authentication.

     I am installing the newer fail2ban on these machines to get the loads down as with the client server.

Client Mail Server Under Attack

     The client mail server is still under attack.  The attack consists of a botnet that is trying to elicit login and password for users by brute force methods using postfix auth.  So far a little over 3,000 IP addresses are blocked but fail2ban can’t work fast enough to get them all in real time.

Reboots and DDoS

     Iglulik spontaneously rebooted again this morning, still no idea what is causing this.

     Mail was under a DDoS attack as of around 9:30 this morning, still ongoing at 10AM, fail2ban is essentially saturated locking out attacking IPs as fast as it can.  Load is heavy but server is still functional but a bit slower than normal.

NFS on Linux

     I’ve had problems with NFS on Linux basically forever, and it became significantly better AND worse under NFSv4.  It got better in that locking mostly works under version 4.2, so things like alpine work correctly where the mail spool is NFS mounted.  It got worse in that sometimes mounts fail to mount, especially if a server goes away and returns.

     I noticed recently that this is a much larger problem with servers that have a lot of entries in /etc/exports.  This lead me to investigate whether there might be a limit on the file size.  Well, I did not find a limit but I did find that the sanctioned method of exporting the same file system to multiple hosts was to put all the hosts on the same line separated by a space and each host followed up by options, where I had had each host on a separate line.

     So I changed the format of /etc/exports from the old form which is how SunOS expects it to be formatted one, to the form officially sanctioned by Linux and will see if that helps.

2:30 Crash

     Around 2:30pm everything spontaneously rebooted.  Just prior to that our mail server had been under a DDOS attack by a botnet, fail2ban had been working hard to lock out attacking IPs, still the load on the machine was 800, and while I was trying to chase that everything crashed.

     When things came back up, NFS did not work right, mail had quota on the /root partition AND had mounted it read-only.

     I’ve got the physical hosts, web server, mail infrastructure, and ubuntu shell server up and running, I am working on checking the rest of the servers for proper NIS/NFS connectivity.

Iglulik Status

     Sorry the downtime was longer than anticipated.  I ran many stress tests and was not able to get the machine to error, freeze, overheat, or otherwise act up.

     First, I upgraded the BIOS because experience has taught me that newer Asus BIOS software is generally more stable than old Asus software.  On my workstation I had to set the CPU core at 1.39v to be stable at 5Ghz with the older BIOS, the newest allowed me to reduce that to .95v which is easier on the CPU.

     The BIOS has a function where you can save the configuration to a thumb drive and generally you want to do this before an update because it erases all the settings.  It saved just fine but the new BIOS would not read the file created by the old so I had to reconfigure everything by hand.

     I did find some less than optimal settings, for example I had decode above 4G disabled, the problem with this is that it forced Linux to use bounce buffers rather than the hardware DMA’ing directly into the location where the data is required or from it, thus making I/O less efficient, so I fixed that.

     I also increased the CPU core from 1.29v to 1.35v which makes the CPU run slightly hotter and decreased the clock from 4.3 Ghz to 4.2 Ghz so that if it was on the edge of stability it should be better.  However, I ran many stress tests and was unable to get it to fail before I made the changes.

     I ran additional stress tests after completing these changes to make sure temperatures were still within an acceptable range and they were well within safe limits even with the higher CPU core voltage.  So at this point I am just going to watch it and see if it is still a problem.

Iglulik Web Outage

     I have reverted Iglulik to an older known stable kernel and still it spontaneously booted last night (and did not start the web server upon recovery) so now I know there is a hardware problem.

     Tonight shortly after midnight I will be taking this machine down for a while to run some diagnostics to try to identify the hardware problem since nothing is showing up in the logs.  Most likely CPU or memory error, most other things would have been logged.

     It’s been a while since I last did a BIOS update and the last Asus BIOS update I did on my workstation in April greatly improved stability so I will check for a BIOS update while I’m at it.

     Because this server has the /home directories, ALL shell servers and the web server will be out of service and pop/imap will ONLY be able to access your INBOX and no others during this maintenance.