Maintenance Outage

I plan to take most services off line between 11pm-12pm tonight, March
12th, 2024, for about twenty minutes to replace a failed network interface
card in one of the servers. Because this server provides some disk storage
for most of the machines via NFS this means most services will be unavailable
during that interval.

Outage

     The machine that is doing double duty as a webhost and router is ill.  I have it running in a crippled state but if it reboots it will not come up automatically.

     For those who are interested in the technical details and may have encountered this before and can provide some hints, what is happening is that when it boots it tried to create system users with systemd-sysusers.service which runs systemd-sysusers as a one-shot, it does not complete however and times out.

     The other thing that is not working is mdmonitor which uses mdadm to build the RAID devices (everything on this machine is on RAID).

     So after both of these timeout, I can login to single user shell, run them by hand, and they run fine.  Hence the big mystery.  I am going to go back tonight and try rebuilding initramfs on the off chance it is broken.  If that fails I’m going to attempt an upgrade to 24.04, I did manage to get the old web server which was doing something similar working this way.

     But if worst comes to worst I am going to have to re-install the machine and that may take a while.  So service may be spotty tonight.  I apologize for that but sometimes things just don’t give you an option.

Server Issues

    We had an issue with the web server eating itself after an upgrade introduced a library that conflicted with a library I had compiled in order to enable http2 protocol before Ubuntu included it in their distribution.

     I ended up having to restore this server from backups and bring it forward again, removing the offending library in advance so that this would proceed properly.

     I upgraded the server to php 8.1, however some apps even which were supposedly 8.1 compatible did not work.  It suggests a problem with our 8.1 install.  I am working to remove existing users and apps from this server and transferring it to a new server.

     I am laying some things out differently, in particular some webapps, like roundcube, which are presently https://www.eskimo.com/roundcube, will instead get their own subdomain, https://roundcube.eskimo.com/.  There are several reasons for this.  First, it allows each to exist in the root directory of it’s subdomain, most code doesn’t care but there are some applications that do.  Second, it allows each application to have it’s own .htaccess file so I can tailor the server environment for that specific application.  Third, it allows moving applications to different servers for load balancing reasons.

      I am upgrading some servers to 24.04 early just because it has more compatibilities with a lot of my self-compiled libraries than does 22.04 which is two years old now.  There are some elements of 24.04 that are improvements, but pretty much all the systemd bugs from 22.04 were retained.  There seems to be a new bug in which NIS reports to systemd that it’s up and running about three seconds before this is actually the case.  This causes issues with applications depending upon NIS to be running first.  I’ve worked around this by adding automatic restarts to services thusly affected, so that if they fail to start the first time, they will restart three seconds later.

Carl Jung

“The spirit of evil is fear, negation, the adversary who opposes life in its struggle for eternal duration and thwarts every great deed, who infuses into the body the poison of weakness and age through the treacherous bite of the serpent; he is the spirit of regression, who threatens us with bondage to the mother and with dissolution and extinction in the unconscious. For the hero, fear is a challenge and a task, because only boldness can deliver from fear. And if the risk is not taken, the meaning of life is somehow violated.” —C. G. Jung, Symbols of Transformation, par 551.

SSH Key Vulnerability

     A new ssh key vulnerability has been found affecting RSA keys which are the default in many older Linux implementations.  I strongly suggest generating a new ed25519 key using the command ssh-keygen -t ed25519 and remove any RSA keys you may be using. To do this remove relevant lines from your ~/,ssh/authorized_keys file. RSA keys will all start with ssh-rsa.

     Although any keys other than RSA will be safe from this particular attack, I recommend ed25519 because, being an elliptical curve algorithm, it is presently safe from all known quantum computer based attacks.  Any algorithm that depends upon the ability to factor the product of two large primes is vulnerable to quantum computers because they make this a very fast and easy task.

     After you make this new key and delete your old RSA keys you will need to use ssh-copy-id login@hostname to be able to use ssh-key authentication on that machine.

     Here is the article where I learned about this new exploit.  Unfortunately they don’t get into detail about what comprises a computational error, whether this is a hardware or an algorithmic error.

https://arstechnica.com/security/2023/11/hackers-can-steal-ssh-cryptographic-keys-in-new-cutting-edge-attack/amp/

Outage

Sorry it took me so long to get this back up.  I thought the machine that was serving as router had crashed, it had not, however, the  device driver for the ethernet card had unloaded.

When I tried to load it it told me invalid argument.  Odd since I hadn’t changed any arguments, in fact I had no arguments at all.

I spent about an hour and a half futzing with that and then decided to see if that driver was included with the generic kernel (different version), it was.  I loaded the generic kernel (sub-optimal for our needs but okay until we get the new router running), and then I was able to get the card operational.  However, I had messed up the network settings by this time and had to spend another two hours figuring it out, primarily because at some point I transposed the two connectors for network and LAN.

Stability or the Lack Thereof

This morning I finally figured out the source of the most recent instability (since we changed out the bad NIC card).

We kept having this incidence where I’d go to the co-lo, thought I had
everything working but in minutes or hours or sometimes a few days it would
just stop talking to the Internet.

One of those occurred this morning, I went down, looked at settings, nothing appeared to have changed, but it wasn’t routing. Rebooted the server,
started routing again. Went back home, couldn’t ping anything.

And I’m really half a sleep and my workstation is busted on account of the
fact that the night before I tried to upgrade the OS, it failed, and so I tried
to restore from backup but afterwards I could not boot. And at this point I
have had approximately two hours of sleep in the past 48 hours.

So I drove back down to the co-lo center again, and keep in mind it’s
22 miles each way and pretty close to rush hour so not at all a pleasant
drive. This time I rebooted again but still no route, several times, still
the same.

So at this point I got the NOC involved, neither of us could ping the
other end of the wire and that should not have been difficult since it’s just
a single Ethernet to Ethernet cable. But we couldn’t, so I got the idea of
unplugging the LAN interface and after doing that the WAN interface immediately
came up, so this pointed to something wrong on my end, didn’t know what but
had to be something on my end.

So I took a look at the routing table and it quickly became apparent what
was wrong, there were not one but TWO default routes. This is not legal under
Linux so how did it come about?

Well on the WAN interface, I had the correct gateway address, but on the
LAN side I was pointed at my own machine instead of his router. Why I did
his is because if I had a rounder in between my machines and LAN and his router
my router’s IP would be the gateway IP we point all the local machines to.

So corrected this to the correct gateway IP, everything came up from a
routing perspective and has remained that ever since.

But once I got home, I got another telephone call, not able to receive
e-mail from g-mail. At first I had the problem of not having a workstation
to use, totally forgot about my laptop, probably sleep deprivation.

But I did remember I had an antique Dell loaded with Linux, fired it up
and SSH’d into the mail servers, found both were operational but both had a
bunch of jobs stopped but no logged errors indicating why. I rebooted them
and they came up and ran fine except that the load went up to around 200 for about half an hour then settled down to a normal below 1 load.

At this point I’m speculating that without a network queue runs got stuck
until they exhausted memory then things died.

At this point I returned to my workstation. I had restored a corrupted
root partition from backups but after doing so it would not boot. I finally
chased this down to the fact that I did a reformat prior to reloading from
backups and this changed the partition UUID so that it no longer matched the
fstab file. Fixed that up and now it’s running properly again.

I expect to receive the new router sometime between this Friday and next
Monday, I don’t know how long it will take for me to learn how to use it will
enough to put it into service but it is very thoroughly documented, has eight
manuals, the administrative manual is 3061 pages. Also some free online courses and if you want to get into it deeply you can spend as much as $6000 on
non-free training. It supports damned near every communication protocol known
to man.

Outage

Sorry for the down time.  Our second router to die in two weeks died yesterday afternoon.  At first I thought it was something I did but then after restoring from a backup when it was working it still didn’t.  The vendor had changed the software in such a way that the translation of firewall rules from the user interface to iptables internally was no longer happening correctly resulting in a broken firewall that can’t be disabled.

Since I had no further spares to drop back to, I configured out of the Linux servers to act as a router until we can get one.

While we were not able to add a Diaspora at this time, We DID Add a Mastodon Instance

     You can view our latest social media instance either by using Web Apps on our main web site https://www.eskimo.com/ and selecting Mastodon, or you can go there directly https://mastodon.eskimo.com/.

     Mastodon is a Twitter/X link interface but unlike Twitter, Mastodon is part of the Fediverse, like Friendica and Hubzilla, which consists of tens of thousands of sites with no one owner of the entire network.  Each site is responsible for moderating it’s own servers If it chooses to moderate at all.