Network Backbone Maintenance June 3rd 12AM-5AM

ISOMEDIA Service Affecting Network Maintenance 6/3/2021

On Thursday, June 3rd, beginning at 12:01 AM PDT and continuing until 05:00  AM PDT, engineers will be performing maintenance on the network backbone ring. This maintenance has the potential to be service affecting, with the possibility of multiple periods of up to five minutes with limited connectivity.

ISOMEDIA is where our equipment is co-located and provides our Internet connectivity.

Spam Filtering Issues

     Recently we’ve seen huge floods of spam and more disturbing scams and viruses, that are not yet caught by clam-antivirus, from providers that either market to spammers but also have a few legitimate customers, or providers that are simply hacked to death.  Affected sites include Digital Ocean (lots of “.tech” spam), Amazones (random spam, scams, phishing attempts), mailspike and sendgrid (massive amounts of spam).  Many smaller sites primarily in Eastern Europe.

     The floods from these sites has been so severe I’ve been forced to block part of their address space.  However, this has the side effect of also blocking legitimate clients from their servers.  This is less a problem with Google’s cloud because they have all their legitimate customers properly configured with SPF, DKIM, and DMARC and I can simply block anything from their servers not properly configured.  But with these others most of their customers do not implement any of these protocols.

     I have our servers configured so that we collect all the relevant mail address data before either accepting or blocking the connection and then log that data whether accepted or rejected.  And I have a access list for domains and/or e-mail addresses so we can either accept domains from sites that are otherwise blocked, or reject known spam domains.

     But if we are rejecting legitimate addresses, I do not know until you tell me, and an e-mail or phone call saying I am not getting e-mail from XYZ is not helpful unless you have the sending domain and your e-mail address, even then the best way for me to receive this information and act on it is if you generate a ticket.

     To do this you can go to our website, https://www.eskimo.com/ and under the Support pull-down select Tickets.  If you do not already have an account on the ticket system, create one.  You do not have to be a customer to create tickets, so if you are someone trying legitimately to e-mail a customer here and getting a bounce, you also can use this system to report the issue.  For those address spaces we have to block I can create exceptions for domains or e-mail addresses that are legitimate.  Please note however, I CAN NOT create an except that will allow mail from misconfigured sending sites to get through, in those cases the SENDER will have to fix their configuration though I will be happy to assist in identifying the nature of the misconfiguration.

Mail Server Work Complete

     I’ve completed work on the mail server.  To get it to not hang, I had to remove all the xorg-xserver-video drivers for hardware that wasn’t present (which is all of them, it’s a virtual machine) and that made it happy, no longer hangs when rebooting.

     While I was in there I removed some other cruft and optimized some settings like enabling huge memory pages and write caching, all of which should help performance a tiny bit.  I also discovered the problem with postfwd, between groovy and hirsute they changed group nobody to group nogroup, but postfwd was still expecting group nobody so I just created a duplicate entry in the group file and that made it happy.

Mail Server Interruptions 11PM tonight Unknown Duration

      I will be performing multiple reboots of the client mail server tonight after 11pm to troubleshoot the issue that causes it to hang during a systemd initiated reboot.  This is a problem with the new Hirsute release of Ubuntu but I managed to fix my workstation by removing a bunch of unnecessary cruft and hope that I can do the same with the mail server.  Because I did quite a lot at once on my workstation, I don’t know exactly what it was that it was getting stuck on.  This may also cause a delay if you attempt to login to a shell server while a reboot is in progress but if this happens just wait a couple of minutes and try again.  The down times will be relatively brief (2-3 minutes).

Mail Server Repaired

     I’ve undone the majority of new bugs introduced by Ubuntu in their Hirsute Hippo release.  It took quite a while to chase some of them down.  I normally avoid using short term releases but dovecot had some serious problems in 20.04, so upgraded to groovy to resolve, and groovy was a clean update, unlike this one, did not break anything, but since these short term releases are only supported for nine months, one is obligated to upgrade again until you get back to a stable release, the next stable release will be 22.04.

     The most disasterous is a Poettering effect, “systemctl reboot” now hangs the machine hard rather than rebooting, this combined with the fact that NFS never really worked 100% correctly under Linux created a situation where I could not get into the physical host to force a reboot of mail.eskimo.com which is a virtual machine SO I needed to drive down to the co-location facility to fix.

     Then they opted to replace my systemd scripts for postfix and dovecot.  This is problematic because on our server various things are mounted via NFS, among them encryption certificates since it would be a pain to maintain 30 separate certificates for each machine, I have a few wildcard certs for various domains and I have the certs mounted on an NFS partition.  That way I only have to update in one place.

     This necessitates delaying the startup of postfix, dovecot, and anything else that needs access to encryption certificates until AFTER the partition is mounted.  This is easily accomplished with systemd scripts provided upstream operating system providers don’t change them for you and remove these things.

     Then they changed pam configuration adding a check for weak passwords which seemed like a good idea except the actual module wasn’t there so it resulted in PAM failures because it couldn’t find the referenced module.

     Lastly they broke postfwd, and this was particularly challenging since postfix which calls it rather than report a problem with it, gave the rather generic error, “server misconfiguration”, yet, postfix check which is supposed to check the configuration file reported no errors.  I finally found it by turning debugging peer on for my workstation IP address and attempting to send a message and tracing through the logs.

     That revealed that postfwd didn’t start.  Postfwd is used here to check for spambots using someone’s account if their password is compromised and force a password change if this occurs.  Postfwd is a perl script and I’m not fluent in perl so that made it a particularly difficult challenge but it was complaining of missing perl modules in the logs, however, the modules it says were missing were in fact installed.  I finally removed the Ubuntu postfwd package and installed it directly from github, fixed.

     So after much gnashing of teeth and pulling of hair, I believe our mail system is back to fully operational status again.  Since it required many reboots, I checked the NFS status of all servers mounting from it and they all appear to be okay.

Mail

     I’ve identified the issue with mail, but I’ve managed to hang the physical server in an attempt to fix and it is going to require a drive to the co-location facility so things may be broken for the next 1-2 hours.

Postfix

     Last night’s upgrade broke something in the postfix configuration of the mail client server used for sending mail.  Unfortunately it is giving only an extremely generic error making identifying what it is difficult.  It claims the server is misconfigured yet postfix check shows no errors.  Argh!

     I am working on it.

Tonight’s Maintenance Completed

     All Debian based kernels were upgraded to 5.12.4.

     This took an hour and 15 minutes to complete on the physical server hosting the web server because the iomemory drivers did not want to compile under 5.12.4.  After much pulling of hair I found that it was a libc6 version mismatch between the machine that I installed the kernel on and the one I built it on.  The DKMS module for iomemory required this.

     Mail has been upgraded to ubuntu 21.04.  Most of the servers here are on long term releases but dovecot was barely usable on the 20.04 release prompting a rapid upgrade to a short-term intermediate release.

Tonight’s Upgrades

     The web server will be down longer than normal after tonight’s kernel updates because I have to recompile a driver used with the flash drive for the newer kernel.

     The client mail server, mail.eskimo.com, will be intermittently available for several hours after the kernel update as I will also be performing an operating system update.

Kernel Upgrade Friday 11PM PDT

     We will be doing kernel upgrades on all of our servers.  Expected time frame 11PM-11:30PM with perhaps a straggler machine or two.  We’ve had a lot of systemd issues recently that has made startup less than 100% reliable.

     This will affect all of our shell customers and web hosting customers as well as https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, and https://nextcloud.eskimo.com/