I was able to fix the existing power connector with epoxy, it no longer loses power if I wiggle it around. And I took the CPU off and it definitely developed a bad air gap, almost a penny right in the middle of the heat spreader. I could tell because the compound had dried up there. The thermal grizzly extreme is a read color that turns a pink if it dries up so is very visually obvious. I am putting it back together with some lesser quality paste because that’s all I have until the gets here, so I will change that and the power supply out at that time. Going to switch from a Gigabyte 1000 watt to a Seasonic 1200 watt unit. So I will do all of that about a week and a half from now when everything arrives. I am going to let it burn in for a couple of hours and if it does not crash bring it back into the co-lo facility tonight.
Update on Inuvik (the newest server)
Inuvik presently provides service for: roundcube.eskimo.com, friendica.eskimo.com, hubzilla.eskimo.com, mastodon.eskimo.com, and yacy.eskimo.com (and I am working on some additional services).
It has been unstable and after bringing it home I discovered the power connector was flaky. I ordered a new cable but it was the wrong cable. The motherboard has a 24-pin plus two 8-bin connectors, but the cable provided by Gigabyte with the PSU was a 20+4, and the issue is the +4 does not have it’s own latch and as a result worked it’s way out of the socket.
So while I have ordered a new PSU, a 1200 watt platinum from Phanteks (actually manufactuered by Seasonic), it has the same connector arrangement as the Gigabyte and will take a week to get here. I don’t want to leave the machine out of service again this long so…
I am going to attempt to remedy this by placing a dab of epoxy between the 20-pin and the 4-pin connector on the motherboard side to lock the 4-pin in lockstep with the 20-pin which does have a latch.
In about a week I will get the other supply and replace this whole mess. I like the wiring with the other supply better because it is not ribbon cable but individually sheathed wires and they appear the be of a heavier gauge.
Other 20+4 connectors I have had have had a locking mechanism to lock them together or a separate latch. This Gigabyte cable has no such and no separate clamp, a really piss-poor design in my opinion.
Friendica.eskimo.com, Hubzilla.eskimo.com, Mastodon.eskimo.com, Yacy.eskimo.com, roundcube.eskimo.com
The above services will be down until some time Saturday evening. This machine has been unstable and getting increasingly so. It does have a heat issue which I will resolve when some new heat sink compound arrives, but even down clocked so it did not get too hot it was still crashing and sometimes when almost entirely idle.
When I got it back at my home to work on, I discovered the ATX power cable was flaky, and I could make the entire motherboard lose power by just wiggling the cable. This is probably introducing noise and unstable voltage to the CPU.
I have ordered a replacement cable and it should be here Saturday. There may be other issues but a bad power cable is definitely an issue so I will start by replacing it.
Inuvik Too Hot
Inuvik is running too hot. This machine was running at 4.8Ghz small fft torture test 36 threads 2/threads per core before I brought it over to the co-lo but it is exceeding 96C now but only on a couple of cores.
When you have a couple of cores running hot on a multi-core CPU but the rest are normal, this is usually indicative of an air bubble between the CPU and cooler so part of the heat spreader is not receiving cooling. This is more pronounced with the i9-109×0 series of CPUs because the heat spreader is soldered to the die. On most microprocessors there is thermal compound between the die and the heat spreader. This creates some diffusion that does not occur when the die is soldered to the heat spreader so any air bubbles are more critical.
I’ve ordered some more Kryonaut Extreme which should get here between October 1st and 3rd, at which time we will pull the machine from the co-lo for a few hours to clean the CPU and heat sink and re-paste it. I will perhaps be just a smiggin’ more generous with the paste this time. I am stingy not because of cost but because no matter how conductive thermal paste is it is less conductive than the metals you are trying to transfer heat between so you want as thin of a layer as you can get away with, but the worst thermal paste is better than the best air so a little too much is less bad than not quite enough which appears to be the case presently.
Between now and then I’ve reduced the speed of the machine from 4.8ghz to 4.4ghz and CPU voltage from 1.37 to 1.2v to reduce heat generation. This will reduce performance by slightly less than 10%, but give it’s around 97% idle time on the CPU’s this should not be a problem and it’s only temporary.
Right now this is more of an issue than it otherwise would be because there exists a bug in the kernel code when it writes to the MSR to change the CPU speed in response to excess temperature. If this bug did not exist the machine would simply have automatically downclocked, but this is a current bug affecting these particular CPUs.
Network Outage
When ice, the server that previously experienced the hard drive failure failed, I moved the network connection to another server, inuvik, but, I had been using ice because it is the only machine which has all Intel network interfaces.
The other machines all have one Intel and one Realtek, thus to use them as a router requires the use of a Realtek NIC on those machines. The device drivers for Realtek network interfaces under Linux have been dodgy, last time I had to do this we could not get the NIC to see carrier at 1gb/s so had to run at 100mb/s temporarily but this time it saw carrier fine and was stable for a week which is some kind of record. But when I got to the co-lo the NIC was totally locked up. Not even a reboot got it going, had to power cycle the machine.
So now that ice has a new drive, the network connection is moved back to it and things should be stable (I know, famous last words).
Apple IOS 18 and Apple IpadOS 18 MailBug
If you are thinking of upgrading to IOS 18, you may want to hold off as there is a bug in the mail application. You may be able to work around this by adding just a ‘/’ for the mail path and you mail not.
If you have already upgraded to IOS 18, you can use webmail on our website (under web apps) to access e-mail IF the ‘/’ fix does not work for you. Here are the contents of a ticket sans user info:
The mail app on my iPhone and iPad stopped working a few days ago. My subscription expires on 10/5, so I need to know if this is a fixable problem. I reloaded the app several times with all the correct info (below). It returns the following error messages: Updated Just Now Account Error: Eskimo. Cannot Get Mail The mail server "mail.eskimo.com" is not responding. Verify that you have entered the correct account info in Mail settings. Server message "Mailbox doesn't exist: mboxes/buy/ebay/guns/s&wjframegrips"' -------------------------------- I checked and the referenced file does exist. I also reloaded the app several times with all the correct info: Incoming mail server Host name: mail.eskimo.com Outgoing mail server SMTP - mail.eskimo.com Primary server - mail.eskimo.com Host name - mail.eskimo.com Use SSL - On Authentication - Password Server port - 465
It turns out this is a bug introduced into IOS and IpadOS 18 mail app. One user has found that by adding a / to the path (initially a null path) it made it work for them.
The entire thread is here: https://discussions.apple.com/thread/255760038?sortBy=rank
Please keep me updated if status changes, and/or if this work around works for you.
Last Night – Projects
Last night I spent several hours fine tuning our newest machine, Inuvik, I was not able to get any faster CPU speeds, in fact I am running at 4.8Ghz now because I did find with some tests, notably long fft tests, it did show some instabilities even though in a weeks operation these have not manifested.
But I was able to significantly improve memory performance from 2133Mhz to 3400 Mhz and on an 18 core CPU every memory cycle you can get is precious.
I was also able increase the Mesh frequency slightly from 2.6Ghz to 2.8Ghz. The mesh is similar to rings in lower core count CPUs, it is used to provide communications between the cores. I don’t know how much traffic this actually is in a Linux environment so I don’t know to what degree this helps but every cycle one can get anywhere.
Ok, after that I turned to kernel upgrades, and the last machine to upgrade was the one used for our router. It had a drive that had previously shown some SMART errors but after running a diagnostic they went away and it behaved until last night. Last night it absolutely would not boot. Strangest damned thing, could read the drive, write the drive, but could not boot off of it. I’ve never seen a drive failure cause this so I assumed a software issue and re-installed grub, re-installed kernels, re-built the initramfs system, and these are pretty much all the software components you should need to be able to boot but no go. It would find and load grub but grub couldn’t find the kernel, very very odd. I was so convinced this had to be software issue that I had to re-install 25 times to convince myself otherwise. I finally stole one of the drives out of the RAID array and turned it into a new system disk, that worked. But it does not have everything it needs and it’s not a healthy young pup itself.
So for now I’ve moved the routing and all the virtual machines off this box. At present the two services are down are the NIS master which means you can’t change your password or login shell at the moment, and a DNS server, but we have six so that isn’t going to seriously impair things.
I have a new drive which I had purchased when the first drive started puking out SMART errors, and it also is a 7200 RPM drive with 4x the cache the old drive had. At present I’m copying all the data off the failed drive to the drive I stole out of the RAID array to bring the machine back up, and then I’m go replace that old drive that has failed with the new one, recover any data I need for NIS and for the name server off that drive, then reformat and return it to the RAID array.
I am working on a new video conferencing feature to add to our site shortly. I tried to get another suite working but it depended upon a message protocol that we are not able to get to work. This one uses infrastructure I am more familiar with so my chances are somewhat better.
I am also looking at RustDesk as a possible replacement for Guacamole because the developers have really turned Guacamole into an unmaintainable disaster. It really is oriented towards LDAP auth and our network isn’t, so that doesn’t work so well for us.
Inuvik
Our newest and most powerful server, Inuvik, is now restored to service. It is an i9-10980xe CPU with 256GB of RAM clocked at 4.9Ghz. The big challenge to getting this operational was finding a motherboard that could reliably supply the enormous power requirements of this chip. While rated at a TDP of 165 watts, this with only a single core not clocked at more than 4.8Ghz and the remaining cores at a baseline of 3Ghz. All cores clocked at 4.9Ghz with a heavy load such as prime95/mprime #16 torture tests test #2, small fft, 36 processes (all 18 cores provide hyperthreading), it can draw 540 watts. With a CPU core voltage of 1.32 volts, this works out to 409 amps, a lot for a PCB trace to handle and in fact on the Asrock motherboard, it melted the solder at the CPU socket. Better boards handle this by having a number of planes dedicated to power and ground.
At this point, friendica, hubzilla, roundcube, yacy, and Manjaro shell server are all again operational. There are issues with Mastodon, an update left it broken and I’m still troubleshooting. Because of it’s refusal to run under a modern OS, I have Ubuntu 20.04 installed on a virtual machine that is then proxied to through the main machine via a private network. Something seems to have gone afoul with this but I’m still trying to nail it down.
System Issues Resolved
There was some major weirdness this morning and afternoon. It started with the mail server not responding to NFS requests. Mail is a virtual machine on the Igloo physical host, so to reboot it I had to login to Igloo however, Ubuntu in their infinite wisdom has some system wide scripts that run when you login and among other things check the mail by looking at the local mail spool which on all of the machines is NFS mounted from mail. At this time mail was still responding to imap, pop3, and smtp so it was still possible to use via Thunderbird, but this broke shortly after I posted about it.
It used to be NFS had a timeout and when a server did not respond, if you were patient you would eventually get past this. But apparently the default is no longer to time out.
So I had to drive down to the co-lo facility to reboot that machine, I apologize that I did not hear the phone ring, I was sleeping heavy and late owing to being sick last night. Not sure what upset my stomach but I upchucked in the middle of the night and my upset stomach made it difficult for me to get to sleep for many hours.
So when I got to the co-lo I could not reboot the physical host even with the three finger salute. It hung on shutting down guests. I had to forcibly reboot it with the magic-sys-request key, alt+delete+printscreen+B to force a boot. Now I had a newer kernel prepare, three issues newer than the one in service and I knew there were some memory leaks among other things fixed, so thought well might as well install the new kernel on the physical hosts and mail while I am here.
This went ok on Igloo and Iglulik, but when I went to reboot on Ice, it would not come up. Strangely ice had swapped it’s drive letters between sda and sdb, sda had become sdb and vice versa. I had not moved the drives. A while back I had changed the UUID’s to drive numbers because the blkid program at the time was unreliable leading to occasional failed reboots. Now the machine was randomly swapping drive letter, so I put it back to UUID so it doesn’t care about the drive letters. If blkid becomes a problem again I’ll probably switch to labels which honestly makes more sense anyway.
I will be rebooting some of the shell servers and other non-physical hosts tonight to upgrade the kernels on them. I’m also going to try to find the script that is checking for mail and eliminate it on the physical hosts so I can reliably get into them if mail goes down again.
System Issues
There appears to be an issue with the mail server this morning. None of the NFS mounts to other systems are working and ssh isn’t working but mail services are working so mail clients like Thunderbird are working. I am trying to get into the physical servers so I can reboot mail but it’s hanging on NFS. It will eventually timeout but this may take some time.