Inuvik

     Our Inuvik server which hosts manjaro, friendica, hubzilla, mastodon, and yacy, proved to be unstable under 6.11.4 as it was on 6.11.3, so I am headed over to the co-lo facility to reboot back onto 6.11.2 which is stable.  Estimated return to service time 01:15 Pacific Daylight Time.

Maintenance Work Is Complete

     Inuvik now has a brand new Seasonic 1200 watt supply, it also has the good thermal paste now which lowered CPU max temps by about 10C, and I wire tied the fans to the heat sink because the fan clips kept slipping off.

     It took me longer than expected because I forgot two of the drives required a power adapter because they had a feature for remote power control that utilized one of the power pins and without said adapter they would not power up.  In addition, I accidentally hit the bios erase button instead of the power button and had to reset all the settings, but it is done and operational.

     All of the services it is supporting, roundcube, yacy, friendica, hubzilla, and mastodon, are now all operational.

Planned Maintenance

Just a heads up, I will be taking Inuvik down for about 4 hours tonight  to replace a power supply. It was initially very spotty after I brought it back online after a month or so of downtime to get a working motherboard in place, but after a week or so settled down.

This is typical of failing electrolytic capacitors, the plates will deform when there is no power and they will have leakage and less than their rated capacity when first used, but with power over time their plates will reform. But they are on their way out and will eventually fail hard, and on a 1kw power supply, a short circuit is not something you want.

Therefore I am going to replace it tonight, actually changing it out for a Seasonic 1200 watt with a 12 year warranty, so hopefully won’t have to deal with this issue on this machine again any time soon.

Priorities, Immediate Works, Future Plans to Address

     This outage, I learned three resources were particularly important to have available on more than one machine:

  • DNS – Without this mail will returned no such address.
  • SSL Certificates – Without this no encrypted services, mail, web, databases, can be started.
  • Mail spool – Without no mail services.
  • Home directories – Without no mail folders other than INBOX, no customer websites, and no shell services.

     All of these are single points of failure that bring important services down entirely.  Here are some general priorities I try to achieve.

  • No loss of customer data, e-mail or home directory contents.  I have been very successful at this objective.
  • No loss of incoming mail.  Not 100% on this mostly because of network outages making DNS unavailable.
  • The uninterrupted ability to send and receive mail, more severe issues in this area.
  • Uninterrupted availability of your websites.
  • The availability of some shell servers.  I do not strive to have all of them up all of the time because there is enough overlap that services available on any one server are also available on others. Because of the direct access to the OS on these machines, security is a more difficult challenge than on other services so I do require more time to address issues on these.
  • Ancillary services such as Nextcloud, Friendica, Hubzilla, Mastodon, and Yacy.

     Here are plans I have to address these issues at this point.  They are not fully developed which is part of the reason I am sharing this now, to get your input during the process.

My workstation at home is an 8-core i7-9700k with 128GB of RAM and about 20TB of disk.  It is on 24×7 but occasionally I boot windows to play games, so not available 100% of the time but obviously I am not going to be playing games during an outage.  Nothing I do really requires 128GB on this machine, I just happen to have an opportunity to acquire 128GB of RAM for about $36 so did.  But now since I’m also running Linux on it, and I have five static IPs with Comcast, my idea to address not ever disappearing from the net or loosing incoming mail is to setup a virtual machine on my box here and on it install bind, postfix, and set it up as a name server that is outside of the co-lo facility so if our network connection goes down we will still have a working name server and then postfix will be setup as a store and forward server, that is it will be a lower priority MX server that if the first two are unreachable mail will come to it, store, and then when the primary servers come back online it will forward to them.  This would address the second issue, not ever losing incoming mail.

     The third issue is more difficult to address because the mail spool is a single point of failure.  I could use rsync to maintain a near time duplicate, but the issue is if we switch to that during an outage of the primary server and then rsync then stop the primary incomings and let mail go to the store and forward server while we rsync any changed mail spools back to the original spool directory, any mail that came in between the last rsync from the spool to the secondary spool would be lost.  I have to do some experimentation to determine how often rsync can reasonably be run and how minimal that time span can be made.

     I can do something similar with home directories, this is less problematic than mail spool because the mail spool contains all INBOX mails for a given user in one file, but most home directory files are not subject to as rapid change and only those people who use procmail to sort into folders will risk any loss in this case, and we can rsync any files with a more current update when primary storage goes back online.  If we can duplicate home directories then duplicating the web server is pretty trivial, in fact when we get the big machine stable we will have two web servers operational under normal circumstances.

So while not totally thought out I’m letting you know how I plan to address these issues but open to input.  Particularly if there is some risk of losing mail between the time of last rsync and the primary system going down, is that risk worth having the ability to have access to mail during an outage of the primary server?

     Now in the more immediate future, the motherboard arrived for the ice, I don’t know for 100% sure if it is the motherboard or power supply, I replaced the supply with one I had on hand but still had the same problem but I’m not 100% sure that supply isn’t also dodgy as it is from the same vendor and I do not remember it’s history.  At any rate, I’m going to try to replace the motherboard tonight and if the machine works, I will return it to the co-lo facility Friday evening and take down Inuvik which has friendica, hubzilla, mastodon, yacy, and roundcube on it and take it back home to replace the power supply, and probably return it Saturday depending upon time frame.  Power on that machine is kind of a nightmare but I should be able to replace it in one night.

     Lastly, I am preparing kernel 6.11.1 for installation, 6.11 fixes a couple of issues.  6.10.x had an issue with some of our CPUs when it came to changing clock speeds in response to loads.  It detects an error when writing the MSR register, this is a register in the CPU that controls, among other things, the clock multiplier.  It actually succeeds and so it does change clock speeds appropriately but it doesn’t know it succeeds and so generates kernel splats.  This is fixed in 6.11.x.  I will apply this when I am at the co-lo so there will be a brief (around 2-3 minute) interruption in every service.

Outage – Post Report

    To the best of my knowledge, everything is back online now but not all of the hardware, thus some things will be not as fast as usual.

     We have four physical hosts providing the services with numerous virtual machines on these hosts.  Two hosts are i7-6700k 4 core/8 thread systems with 64GB RAM, an i7-6850k system 6 core/12 thread system with 128GB of RAM, and an i9-10980xe 18 core/36 thread system with 256GB of memory.  Of these four machines the last is really the big work horse as it has more CPU and as much memory as all of the other machines combined.  It also has dual nvme RAIDED and 16TB hard drives RAIDED.  I had intended this really to be the model for our next generation servers.  Even though this CPU design is five years old, Intel has not produced any non-Xeon systems capable of addressing this much memory since and memory is our biggest constraint.

     The newest system has become unstable and it is unstable in a bad way, instead of merely rebooting and coming back up, it is hard hanging.  I have had this very issue with this server before and the last time around it ended up being a bad power supply.  But in the meantime the Asus motherboard also developed an issue where it would not see one of the memory channels.  This is typical of a bent pin on the CPU socket except I had not had the CPU out of the machine.

     So I bought a replacement Asus Motherboard, and it had exactly the same issue.  Asus support told me the memory I was using was not compatible (even though it had previously worked), and so at this point I decided to try another company, and went with an Asrock Motherboard.  That motherboard ran four hours and then died with a puff of smoke.  Upon examination this motherboard melted the soldered connection between the power and CPU socket.  The i9-10980xe CPU is an extremely hungry chip and can draw as much as 540 watts with all cores fully busy at 4.8Ghz.  Even though the Asrock motherboard was designed for i9-xx9xx series of CPUs, it was designed for earlier models that had fewer cores and addressed less memory and so were not as power hungry.

     So I then bought a Gigabyte board, went this route because, like Asus, they are designed for overclocking and thus have much more robust power and ground traces to handle the power requirements of monster CPUs.  And  initially all was well, it ran stable at 4.8Ghz all cores loaded with no issues.

     However, after a bit of operation it started locking up.  And when I checked the temps they were high even though I had previously tested under full load and the CPU never got hotter than 62C.  What had happened is that I did not use enough thermal paste and it had developed an air gap between the CPU heat spreader and the cooler right in the middle of the heat spread so cores near that area were overheating.  I fixed that but it still wasn’t entire stable.

     Initially the power supply I used that subsequently died, was a Thermaltake.  When it failed, I replaced it with a Gigabyte PSU, my thinking being that since they make components designed overclocking, this PSU should be, like the motherboard, more robust.  Apparently my thinking is wrong.  Net wisdom seems to suggest the best units are now made by Seasonic, I actually ordered through Phanteks but the supply is a rebranded Seasonic.  This time around I went with a slightly higher power rated supply so it will be less taxed, the prior two supplies were 1000-watt units, which with a CPU maxing out at 540 watts, a very minimal graphics card, maybe 50 watts, perhaps 100 watts worth of drives and another 100 worth of fan, should have been enough but at full load it is running at the upper end of it’s capability.  So this time around I bought a 1200 watt unit so it has a bit more overhead.

     This power supply will arrive Monday.  Then at 4:00AM Sunday morning we disappeared from the Internet.  The machine I was using as a router which had also been rock solid, died.  So I moved the network to another machine with dual NIC’s but one of those NIC’s was a Real-Tek and the Linux Real-Tek network drivers do not work well and can not operate at 1Gb/s, so had to run at 100mb/s but that proved totally inadequate, lot of packet loss and very bad performance.

     I went back to the Co-Lo and took the network card out of the failed machine (Ice) and put it into Iglulik, when I powered Iglulik back up it would not boot.  I took the card out, it still would not boot, so I put it into a third machine and now that machine would not boot.  So now I’m in a situation where I have three dead machines, one that periodically locks up, and has an interface that only works at 100mb/s so I moved the net back to that machine and proceeded to try to diagnose the others.  The easiest machine to get back online was igloo, I could get a grub prompt but not boot fully into Linux, but the fact that I could get to grub prompt suggested hardware was ok and just the boot configuration had gotten mangled, so I repaired the initramfs system and re-installed grub and it came up and ran.  This at least allowed us to have DNS and to have a working incoming mail server although it could not accept SSL encrypted because the encryption certificates were not available.

     I brought Iglulik back, and this machine is particularly important because it has the /home directories and the SSL certificates.  I could not even get a grub prompt, but what is more I could only see six of the seven drives present on the machine.  Everything on this machine is RAIDED except for the root partition because at the time I did this build, I did not know of any way to boot off software MDADM raid.  I have since figured that out and so Inuvik is 100% raided except the EFI partition and even that is replicated, just manually rather than by software RAID.  So of seven drives, is one of the RAIDED drives going to fail?  No the drive with the root partition failed.

     So I replaced the drive and then tried to restore from backups.  Problem is when I mounted the partition labeled backup, there was nothing.  At this point I began to wonder if I had been hit by some malicious virus but at any rate, I had backups on my home workstation as well as a guard against a Ransomware attack.  I tried to restore from that but it was corrupt.  Now I am faced with having to rebuild from scratch and that could potentially take weeks.  But then I mounted all the partitions and found that the one labled libvirt, supposed to have images for virtual machine actually contained the backups and I was able to restore.

     While it was restoring, at this point I had been up for nearly 48 hours straight, I am 65 and don’t handle this well anymore, and so I slept for four hours while it was restoring.  When I got up the restoration was finished but it still would not boot.  There was something wrong with the initramfs file system but I could not determine exactly what.  Eventually I noticed it was trying to mount the wrong /root partition UUID, because when I restored the system the file systems had to be re-created and had a new UUID as a result.  So I fixed the fstab and it still would not work, and I was up all night again last night chasing this down.  I finally discovered that /etc/initramfs-tools.conf.d/resume had the wrong UUID in it and fixed that.  Now that machine was bootable and ran.  Because I knew it was going to have to do more than it was originally intended to do until I get the remaining machines repaired, I attempted to remove some unnecessary bloatware, for example CUPS, not needed since not doing any remote printing from this machine AND there is a security issue that makes it wise not to have on servers anyway.  Also bluetooth software, the machine has no bluetooth hardware so that didn’t make a lot of sense.  So removed that.  Then found wpa-supplicant, didn’t know what it was for so looked it up and it said it was for managing wireless connections, well no wireless hardware either so removed it, then the machine got very very sick.  What the online material didn’t tell me is that it’s also a back-end to NetworkManager and tied into the dbus daemon and when you remove it it breaks both things, and with both things broken the machine was so insane that I had a very hard time getting it re-installed, and then when re-installed it still did not work.  I finally determined it was necessary to re-install NetworkManager to fix and got it working.  Took it back to the co-lo facility around 7am and installed it.

     This was enough to mostly get us back up into service except I had to get backups from the other machines to get their functionality up on this hardware.  So I got stuff started restoring and went to sleep, got up after about four hours and started up those services which had recovered, mostly the virtual private servers, went back to bed, slept another six hours, got up and restored the remaining things to service.

     Now during this time, particularly the second day into it, Tuesday, I got some calls while I am frantically trying to figure out what was wrong with Iglulik even after I had replaced the drive and restored from backups and I was somewhat rude to a feel people.  I apologize for this but like I said earlier, at 65 I do not have the endurance I had at 25.

     And I’ve got a bunch of hate calls and hate mail about our reliability, but here is my delimma, I have not raised prices in 30 years, but if I even whisper a hint at doing that people threaten to bolt, and by the same token the only way to provide more reliability is more redundancy.  For example, having enough disk to maintain near-time copies of home directory and mail directories, it would have been possible to maintain at least minimal services.  I ask people to refer more people to use because that is another way to increase income and provide some of these things but only happens minimally.  And this is a rather niche operation so I do understand that.

     And on the opposite side of the hate mail and calls, I also got calls from people who appreciated my efforts, and I want you to know I appreciate your patience.

Another Machine Died around 4AM

     This machine was being our router, it is the only machine aside from Inuvik that has more than one Intel NIC.  So presently Iglulik is playing router but it only has one Intel and one Realtek interface, the Realtek is supposed to  be capable of 100, 1G, and 2.5G but the Linux drivers for it are seriously broken and only function at 100mb/s hence we are at 1/10th our normal network speed.

     The newly broken machine houses some shell servers and all the private virtual machines, and it died too rapidly to copy any data off.  This is the machine that I just replaced the failed drive in the other day.  There are indications that it may just be a bios battery, it will not save settings nor let me load defaults.  So I’m going to start by changing that.  If that does not fix it, then I am going to put that drive into my workstation, transfer all the files onto a 2TB flash drive and take them back to the co-lo to load onto the remaining machines, also stop by RE-PC and pick up another Intel NIC card.  And then go from there.

     Right now the following shell servers are operational:

        popos, rocky, fedora, debian, mxlinux, and manjaro.

Inuvik Restored to Service

     Innuvik is back in service.  I fixed the cable connector by epoxying the 4-pin to the 20-pin effectively making it a 24 pin connector so the latch on the 20 pin section holds all in.  I no longer get intermittent power when I move the cable around.

     I discovered that about a penny sized area of air gap between the CPU heat spreader and the cooler had developed and so regooped the whole thing but this time I did not have the higher quality compound.  Still at 4.8 Ghz under the most extreme CPU torture test I have it maxed at 82C and averaged about 70C, with the better compound it will max at about 62C when there are no gaps, but this is adequate.

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +33.0°C (high = +76.0°C, crit = +86.0°C)
Core 0: +30.0°C (high = +76.0°C, crit = +86.0°C)
Core 1: +29.0°C (high = +76.0°C, crit = +86.0°C)
Core 2: +29.0°C (high = +76.0°C, crit = +86.0°C)
Core 3: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 4: +33.0°C (high = +76.0°C, crit = +86.0°C)
Core 8: +30.0°C (high = +76.0°C, crit = +86.0°C)
Core 9: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 10: +28.0°C (high = +76.0°C, crit = +86.0°C)
Core 11: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 16: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 17: +30.0°C (high = +76.0°C, crit = +86.0°C)
Core 18: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 19: +30.0°C (high = +76.0°C, crit = +86.0°C)
Core 20: +32.0°C (high = +76.0°C, crit = +86.0°C)
Core 24: +31.0°C (high = +76.0°C, crit = +86.0°C)
Core 25: +30.0°C (high = +76.0°C, crit = +86.0°C)
Core 26: +29.0°C (high = +76.0°C, crit = +86.0°C)
Core 27: +30.0°C (high = +76.0°C, crit = +86.0°C)

     This is what temps are looking like now, you notice there is only a 4C spread between the hottest and coolest cores, before that spread was 20C.

     So I would like to have the additional overhead but this should suffice for now.  I have a new power supply on order as this one is a little doggy and sags under load more than I would like.  The current supply is a 1000 watt Gigabyte supply, the replacement is a Phanteks 1200 watt which is a rebranded Seasonic, it’s has a 12-year warranty so at least financially if it dies it will be someone else’s problem.

     So I don’t know if this is going to fix the stability issues or not, but I did run it with the most extreme CPU torture test software I have at 4.8 Ghz all cores two threads per core for an hour before I took it back to the co-lo and no CPU errors.

     So back in service are roundcube.eskimo.com, friendica.eskimo.com, hubzilla.eskimo.com, and yacy.eskimo.com and with all these running full tilt CPU is still about 97% idle and I’m working on some additional new services.

Inuvik Update

     I was able to fix the existing power connector with epoxy, it no longer loses power if I wiggle it around.  And I took the CPU off and it definitely developed a bad air gap, almost a penny right in the middle of the heat spreader.  I could tell because the compound had dried up there.  The thermal grizzly extreme is a read color that turns a pink if it dries up so is very visually obvious.  I am putting it back together with some lesser quality paste because that’s all I have until the gets here, so I will change that and the power supply out at that time.  Going to switch from a Gigabyte 1000 watt to a Seasonic 1200 watt unit.  So I will do all of that about a week and a half from now when everything arrives.  I am going to let it burn in for a couple of hours and if it does not crash bring it back into the co-lo facility tonight.

Update on Inuvik (the newest server)

     Inuvik presently provides service for: roundcube.eskimo.com, friendica.eskimo.com, hubzilla.eskimo.com, mastodon.eskimo.com, and yacy.eskimo.com (and I am working on some additional services).

     It has been unstable and after bringing it home I discovered the power connector was flaky.  I ordered a new cable but it was the wrong cable.  The motherboard has a 24-pin plus two 8-bin connectors, but the cable provided by Gigabyte with the PSU was a 20+4, and the issue is the +4 does not have it’s own latch and as a result worked it’s way out of the socket.

     So  while I have ordered a new PSU, a 1200 watt platinum from Phanteks (actually manufactuered by Seasonic), it has the same connector arrangement as the Gigabyte and will take a week to get here.  I don’t want to leave the machine out of service again this long so…

     I am going to attempt to remedy this by placing a dab of epoxy between the 20-pin and the 4-pin connector on the motherboard side to lock the 4-pin in lockstep with the 20-pin which does have a latch.

     In about a week I will get the other supply and replace this whole mess.  I like the wiring with the other supply better because it is not ribbon cable but individually sheathed wires and they appear the be of a heavier gauge.

     Other 20+4 connectors I have had have had a locking mechanism to lock them together or a separate latch. This Gigabyte cable has no such and no separate clamp, a really piss-poor design in my opinion.

Friendica.eskimo.com, Hubzilla.eskimo.com, Mastodon.eskimo.com, Yacy.eskimo.com, roundcube.eskimo.com

     The above services will be down until some time Saturday evening.  This machine has been unstable and getting increasingly so.  It does have a heat issue which I will resolve when some new heat sink compound arrives, but even down clocked so it did not get too hot it was still crashing and sometimes when almost entirely idle.

     When I got it back at my home to work on, I discovered the ATX power cable was flaky, and I could make the entire motherboard lose power by just wiggling the cable.  This is probably introducing noise and unstable voltage to the CPU.

     I have ordered a replacement cable and it should be here Saturday.  There may be other issues but a bad power cable is definitely an issue so I will start by replacing it.