Drive Replacement Successful

Drive replacement went extremely smoothly, total down time of 23 minutes.  Drive is now replicating the other drive in the raid array.  Indications are that it will take another seven hours (it’s been going for 45 minutes) so my projection of 6-8 hours seems to be spot on.  System may be a bit slow during this interval since effectively it’s continuously flushing out the buffer with new data.

Maintenance April 7th 2:00AM ~2:30AM

We will be off line, the entire network, for up to about half an hour, starting around 2AM Sunday morning, to replace a failing drive in the machine which is also acting as a router at present.  This drive is part of a RAID array so all data is duplicated and none will be lost.  If things go smoothly, it could be as short as 15 minutes, if not then maybe half hour or slightly longer.

The big unknown is that sometimes when software RAID comes up in degraded mode, which it will do initially until the new drive is pumped up, sometimes systemd will hang necessitating going through emergency mode and bringing things up by hand.  In my experience this is about 30% of the time.  It will take usually about 6-8 hours for the system to sync a new 4TB drive but the system can operate while this is in progress it just sometimes Poetteringware adds some challenges.

Mail Back to Normal

Today after some three hundred updates, the original SPF checker I was using, the phython3 version, still was not working, so I installed the perl versions of policyd.  I don’t really like perl as I don’t find it very readable relative to python, but presently it is working.

I also found the clamav virus check was dead, re-installed that.  Now all the mail milters, the clamav- virus check, spf, dkim, and dmarc are once again functional so this should reduce the flood of “we’re going to make your life miserable if you don’t send 50,000 bit coins to X” messages.

Also, the perl SPF policyd is actually somewhat better in that it checks both the ehlo host and the mail-from: host to make sure both are allowed from the sites SPF record, while the old checker only checked the mail-from, so this will be somewhat more thorough requiring consistency that the others did not.

I sent myself mail from gmail to make sure incoming was working and also watched the logs a while.

Eskimo Site Status

Ubuntu is back.  Sorry it took so long but many snags along the way.

Our old web server is running without a Network Manager because Ubuntu clods broke it.  I have to set the network interface manually after a boot.

Inuvik is also broken because Ubuntu 24.04 engineers mistakenly put new libs in the main repository instead of the proposed repository before they had recompiled everything compiled against them breaking many things, like the Network Manager on the old www/ftp machine and the mail milters on the mail servers.  They are feverishly working on correcting this but in the meantime some of our machines are hanging on by a thread, and I will be re-loading Inuvik with 22.04 to get it back online.

Ice has a hard drive with one flaky sector.  Normally this would just get re-mapped onto an alternate sector but firmware on this drive is defective.  I could update the firmware but the drive is 11 years old and only has a 64KB cache and has 512 byte physical blocks all of which make it slow by todays standards so I have ordered a replacement drive which has 256MB cache, 4k physical blocks and 7200 RPM rotational speed, all of which will provide better performance.  The failing drive is part of a RAID1 array so no data will be lost as it is duplicated on a mate.

Ubuntu Down

     Ubuntu is down owing to a failed update yesterday.

    I attempted to restore from backups but during the update procedure things went south again.

     I am trying again, if this fails I will re-install.  In the meantime please use any other machine, debian, mint, zorin, mxlinux are all similar debian derived systems.

New Server and Social Media Sites

     Still working on getting the new server resurrected.  Having difficulties getting the RAID to auto-assemble at boot time and haven’t figured out why.  And while investing $300 on a hardware controller could solve that issue, some pointed out to me a reason to avoid that if possible and that is the hardware controller uses a proprietary protocol and if it fails, unless I can get another controller from the same vendor, the data is lost, where as software RAID can be read by ANY Linux system.  So software RAID is desirable from that standpoint.

Outage Difficulties

First some background..

     Our old web server was over burdened, particularly when it came to RAM.  Also it booted off a rotary disk and only the mariadb was on nvme memory thus it was slow to boot.  Linux likes having a lot more RAM than it needs because it uses any not required by something else as I/O cache and this speeds up average disk latency considerably because frequently required items will always be in memory.  Cache was configured as write-back so system never slowed waiting on writes.  The disks themselves had 512MB buffers so even if it waited on the drive it would not have to wait for physical write to media.

     So I decided to build a new server, and for this new server I had several things on the wish list.  One, it would address more RAM, and for this reason primarily I went with an i9-10900x CPU.  This CPU could address 256GB of RAM and it had four memory channels instead of two.  It also had ten cores and twenty threads, a step up from six cores and twelve threads.  The primary limit to this CPU’s performance is cooling. It’s rated a TDP of 165 watts but this is at stock 3.6Ghz clock.  One does not buy a binned ‘X’ CPU to run at stock speed.

     Some testing revealed this was electrically stable up to about 4.7Ghz but at 4.7Ghz busy it drew 360 watts of power.  I used a Noctua 15D cooler, but rather than use the stock quiet fans, I used some noisy after market fans that produced about twice the CFM and about 10x the noise level but if you’ve ever been in a data center, noise is not a big concern.  With these fans testing revealed that it could keep the CPU at or below 90C at 4.6Ghz and at that speed it drew 320 watts.

     I wanted to avoid a water based cooler because at home you get a leak and you ruin a few thousand worth of equipment.  In a data center you get a leak, it goes into the under the floor power and you burn down a building and go out of business.

     So I only had to give up about 2-1/2% of the performance of this CPU to avoid water cooling, not bad.  Then I wanted everything on RAID and I wanted all the time sensitive data on nvram so it would go fast.  I tried to find a hardware nvme RAID controller but if they make such a beast I was unable to find one.  I could only find “fake raid” devices, these work with Whenblows but but not Linux.

     So I ended up going with software RAID.  The one thing I could not RAID was the EFI system partition because this is read by the machines UEFI and it does not know about Linux software RAID.  So while that was un-raided, I had duplicated the EFI system disk on each nvme drive so if one drive failed the system would still be bootable and all I had to do to keep them in sync was modify the scripts that installed a kernel to do a grub-install to both devices.

     And it worked for a while.  Then we lost our forth router there (fried) and at that point I decided to spring for a Juniper router.  The reason I went with this brand is that when we first moved our equipment to the co-lo at ELI, they used Junipers and we never once had a data outage there and they were not at all easy to packet flood which is what made it possible for us to run IRC servers there.  After Citizens bought them, they sold the Junipers and replaced them with Ciscos and packet flooding then took the whole co-lo center which basically left us in a situation where either we got rid of the IRC server or they got rid of us.  So having had such a good experience with the routers there I decided to go that route.  But it’s a command syntax I’m not entirely familiar with and I’m still learning (it is similar to Cisco’s but not the same).

     Meanwhile I decided to use one of the Linux boxes as a router and I used the newest server only because at the time it was the only machine with multiple interfaces.  But it was not stable routing and I did not understand why but after a bit I moved it to another machine that I just put a 1G Intel ethernet into it.  It ran for a bit then ate it’s interface card and became unstable.  I had some spare cards but they all had realtek chipsets.  What I didn’t know about Realtek is that the Linux drivers for them are absolute crap.  They work ok at 100mb/s but a 1Gb/s they randomly loose carrier or cycle up and down.  So I put one of these cards in a machine and set it up to act as a router, that lasted about two days before it crashed.  I went over and found no carrier lights, but after playing with it for a while I thought ok, this is just a bad card and so went to replace it thinking it was a 20 minute job.

     Three cards later and now 10AM the next day it still wasn’t working so I drove from the co-location facility down to Re-PC and picked up an Intel based industrial model 4-port card, these are much more robust requiring multiple PCIe lanes so  you need to use a big slot but that’s ok as I only had a wimpy graphics card that only required one.  That solved the networking issue for now.  The Juniper still will be a better solution but I could completely saturate the 1G interface so we’re not losing any speed with this arrangement.

     But the fun and games were still not over.  I got all of the machines up and running except the new web server.  For some reason it would not automatically assemble the RAID arrays and come up online.  It would go into emergency mode.  There I could type mdadm –assemble –scan and it would assemble the RAID partitions and I could mount them and bring the machine up, but if it crashed while I wasn’t there it would not come up on it’s own.  I spent until 6pm trying to troubleshoot and fix, in the past when this has happened it has always either been an issue with the EFI system partition, and I had already re-installed grub 32 times to no avail, or it was a problem with the initramfs, solved by re-creating it, but neither of those things were the cause and I wasn’t successful at locating the error causing it in the logs.

     So finally at 6pm I just re-installed Linux and resolved myself to recovering everything from backups.  So I re-installed Linux, went home, and by then 8pm, I had been working on this for about 33 hours without sleep (I had started working on it at home before deciding to go down and swap out the Network Interface cards).  So went to sleep.

     This morning I proceeded to work on installing software and restoring things from backups and getting the machine configured again and part of that process required a reboot, from which it did not recover.  So I drove to the co-lo thinking I just forgot to configure the proper boot partition in the UEFI bios or something like that and instead found it in the same condition it was in before I installed Linux.

     But this time after a number of attempts I caught an error message it through that was only on the screen for I would guess less than a tenth of a second and what I noticed was that it started with initrd, suggesting an issue with the initramfs, it took about ten more reboots to make out that the message was: initrd: duplicate entry in mdadm.conf file.

     So I checked and sure enough the system had added an entry to those I had entered by hand, identical.  So I took the extra entry out, did a chattr +i file to mark the file immutable so the operating system didn’t modify it for me again, and went home, hoping I could finish restoring it to service, but when I got home it was again dead.

     I drove back to the co-location center (and it is 25 miles each way) it was on but did not have power.  So I power cycled the power supply and it came back up, but by the time I got home it was dead again..  If I move the cord around it goes on and off so I am assuming there is a bad connection from the pin on the end side, maybe a cold solder joint or something.  At any rate, I ordered a new supply which should get here between 2pm-6pm tomorrow and will go back and replace it when it arrives.  Right now I also have one customer on this new machine, MartinMusic.com, so before I replace it I will try to grab the data for his website just in case it is something else so that I can put it on the old server until this one is solid.

     So hopefully I can get this stable and then go back to learning the Juniper syntax and get that installed.  Then I’m going to work on upgrading the old web server for other work.  The motherboard has one bad USB port on it now so not really sure how long it is going to last.

Outage

     I still have not restored some services, particularly the newer web server upon which a few customers sites are on.  I’ve been at this constantly for about 33 hours without sleep and I have reached my physical endurance limit.  It was necessary to completely re-install one systems operating system as something in the boot process had become corrupted and I could not figure out what.

     I will go into greater details when it’s all done, but right now I really must sleep.

Maintenance Outage

I plan to take most services off line between 11pm-12pm tonight, March
12th, 2024, for about twenty minutes to replace a failed network interface
card in one of the servers. Because this server provides some disk storage
for most of the machines via NFS this means most services will be unavailable
during that interval.

Outage

     The machine that is doing double duty as a webhost and router is ill.  I have it running in a crippled state but if it reboots it will not come up automatically.

     For those who are interested in the technical details and may have encountered this before and can provide some hints, what is happening is that when it boots it tried to create system users with systemd-sysusers.service which runs systemd-sysusers as a one-shot, it does not complete however and times out.

     The other thing that is not working is mdmonitor which uses mdadm to build the RAID devices (everything on this machine is on RAID).

     So after both of these timeout, I can login to single user shell, run them by hand, and they run fine.  Hence the big mystery.  I am going to go back tonight and try rebuilding initramfs on the off chance it is broken.  If that fails I’m going to attempt an upgrade to 24.04, I did manage to get the old web server which was doing something similar working this way.

     But if worst comes to worst I am going to have to re-install the machine and that may take a while.  So service may be spotty tonight.  I apologize for that but sometimes things just don’t give you an option.