Inuvik

Our newest and most powerful server, Inuvik, is now restored to service. It is an i9-10980xe CPU with 256GB of RAM clocked at 4.9Ghz. The big  challenge to getting this operational was finding a motherboard that could reliably supply the enormous power requirements of this chip. While rated at a TDP of 165 watts, this with only a single core not clocked at more than 4.8Ghz and the remaining cores at a baseline of 3Ghz. All cores clocked at 4.9Ghz with a heavy load such as prime95/mprime torture tests test , small fft, 36 processes (all 18 cores provide hyperthreading), it can draw 540 watts. With a CPU core voltage of 1.32 volts, this works out to 409 amps, a lot for a PCB trace to handle and in fact on the Asrock motherboard, it melted the solder at the CPU socket. Better boards handle this by having a number of planes dedicated to power and ground.

At this point, friendica, hubzilla, roundcube, yacy, and Manjaro shell server are all again operational. There are issues with Mastodon, an update left it broken and I’m still troubleshooting. Because of it’s refusal to run under a modern OS, I have Ubuntu 20.04 installed on a virtual machine that is then proxied to through the main machine via a private network. Something seems to have gone afoul with this but I’m still trying to nail it down.

System Issues Resolved

     There was some major weirdness this morning and afternoon.  It started with the mail server not responding to NFS requests.  Mail is a virtual machine on the Igloo physical host, so to reboot it I had to login to Igloo however, Ubuntu in their infinite wisdom has some system wide scripts that run when you login and among other things check the mail by looking at the local mail spool which on all of the machines is NFS mounted from mail.  At this time mail was still responding to imap, pop3, and smtp so it was still possible to use via Thunderbird, but this broke shortly after I posted about it.

     It used to be NFS had a timeout and when a server did not respond, if you were patient you would eventually get past this.  But apparently the default is no longer to time out.

     So I had to drive down to the co-lo facility to reboot that machine, I apologize that I did not hear the phone ring, I was sleeping heavy and late owing to being sick last night.  Not sure what upset my stomach but I upchucked in the middle of the night and my upset stomach made it difficult for me to get to sleep for many hours.

     So when I got to the co-lo I could not reboot the physical host even with the three finger salute.  It hung on shutting down guests.  I had to forcibly reboot it with the magic-sys-request key, alt+delete+printscreen+B to force a boot.  Now I had a newer kernel prepare, three issues newer than the one in service and I knew there were some memory leaks among other things fixed, so thought well might as well install the new kernel on the physical hosts and mail while I am here.

     This went ok on Igloo and Iglulik, but when I went to reboot on Ice, it would not come up.  Strangely ice had swapped it’s drive letters between sda and sdb, sda had become sdb and vice versa.  I had not moved the drives.  A while back I had changed the UUID’s to drive numbers because the blkid program at the time was unreliable leading to occasional failed reboots.  Now the machine was randomly swapping drive letter, so I put it back to UUID so it doesn’t care about the drive letters.  If blkid becomes a problem again I’ll probably switch to labels which honestly makes more sense anyway.

     I will be rebooting some of the shell servers and other non-physical hosts tonight to upgrade the kernels on them.  I’m also going to try to find the script that is checking for mail and eliminate it on the physical hosts so I can reliably get into them if mail goes down again.

System Issues Today

     At some point libvirtd on igloo, the machine which hosts mail and a number of shell servers, failed.  Libvirtd is the server side virtualization management daemon, it is responsible for starting, stopping, arranging networking, storage, and system resources for kvm/qemu guests (also for xen but we aren’t using xen here).

     This affected a number of machines including mail and because every server NFS mounts the mail spool from mail, it affected them indirectly.

     The message that Igloo gave in syslog relating to libvirt was:

        libvirtd[2271]: internal error: wrong nlmsg len

     The “nlmsg” refers to Netlink, so it would appear something went wrong in networking and libvirtd didn’t know how to handle it and crashed.

     I don’t know exactly how long and how deep the outage was since it was kind of a gradual deterioration situation after libvirtd crashed.  I was going to add an automatic restart to libvirtd in systemd to prevent this specific failure in the future but found it was already in place but incompletely specified so perhaps systemd choked.  I have corrected that.

     I received about eight tickets on this issue, and I really appreciate it that the ticket system is being used, but also with outages of this magnitude a phone call would be good because if I’m not actively at the terminal I may not be aware of issues.

Rust

     Rust is a new compiled programming language that users a new memory
management scheme.

     I first learned several assembly languages and then learned C, and because I learned assembly first and thus really think in terms of what the hardware does, I have not had issues with array bounds or de-referenced pointers but a lot of people have. In fact this tends to be what causes the majority of privilege escalation exploits.

     Many languages, Java, Python, Perl, BASIC, etc solved this by using a memory management technique known as garbage collection but this method has severe performance issues. First, it can be difficult for the language to determine if a particular variable will ever be accessed again, thus memory release may be very delayed resulting in wasted memory. But more significant is that garbage collection causes periodic halts in execution that can be very annoying.

     Enter rust, they invented a new method of memory management in which you declare to the compiler how memory is used, in what contexts and time frames, and this enables the compiler to manage memory much as you would do by hand without the human error component.

     This makes Rust an ideal replacement for C, for those who are less disciplined, and for critical tasks, because, like C, it can approach assembler in efficiency, doesn’t introduce the periodic lags of garbage collection, and yet protects you against buffer overruns and pointer de-reference errors.  Now if they will only invent a text editor that corrects run-on sentences.

     I’ve installed the rust compiler rustc on all of the shell servers and working on installing it on the other machines as it will be necessary in the future for kernel compilation.

     The newest version is on Fedora and Rocky8, 1.80, slightly older versions on Ubuntu, and Zorin, 1.75, and even older versions on Debian and MxLinux 1.65. 

Brief Web Outage 14:24-14:27 July 17th

     The brief web outage today lasting approximately two minutes was to apply a security update to the Apache server bringing it up to 2.4.62 owing to vulnerabilities found in the previous version.

     At the same time, I also upgraded the kernel to 6.10.0.  The 6.10 kernel has some improvements that speed up encryption.

Fedora and Rocky8 Info

     Some update pushed on Rocky8 and Fedora broke rwho and ruptime on those machines.  They will still provide user status to other servers but are no longer pulling other servers for rwhod.

     Further, ruptime requires the ‘daemon’ command which is no longer available on these machines.

     At some point NIS will disappear from Fedora and when it does we will be forced to retire the machine at that time.  Because we don’t know when this will happen we can not predict this date.  Therefore we recommend NOT relying on that machine for anything.  If you have cron jobs on it, please move them to rocky8 or if you do not need a Redhat environment, one of the other servers.

Don’t Buy Epson!

I am a little bit more than pissed off at what Epson pulled. I had a WF2950 all-in-wonder inkjet printer / scanner / fax. It isn’t officially supported under Linux but it mostly works. The mostly being the need to boot windows for Firmware updates. Today it just stopped working, the scanner wasn’t seen anymore in Windows or Linux, but there was a firmware upgrade available. So I installed it. After doing so the scanner was seen again in Windows but not Linux, however, after the firmware upgrade it refused to do anything except complain about the non-Epson ink cartridges I have in it. These worked just fine prior to the upgrade. If they think I am going to pay as much as the friggin printer for ink that last about ten pages they got another thought coming. Previously I have gone with HP, and although they always tell you a non-HP cartridge is installed they always have worked fine in spite of complaining. However mechanically they aren’t the best built, frequent issues with scanner parts breaking. But hell if I’m going to be held hostage by Epson. Once you pay for something you don’t expect them to take away functionality with firmware upgrades but that is exactly what they did, so EPSON, FUCK YOU! I bought another HP!

Roundcube Is Back

     Roundcube mail is back.  It is in the WebApps->Mail->Roundcube menu or the direct path is https://roundcube.eskimo.com/.

     This is running on the new server and will be down between Wednesday evening and possibly Saturday though I may be able to get it back up sooner.  I am replacing the motherboard, AGAIN, as the new motherboard is stable BUT has two dead memory slots so the machine is only seeing 192GB of 256GB of installed memory.

     None the less, it is now an 18 core machine, 36 threads, at 4.8Ghz all cores.

     The way I finally got roundcube working is that I installed Ubuntu 20.04 on a virtual machine on the new server.  I have this virtual machine listening to a non-routeable IP address so that it can not be reached directly from the outside world to provide some degree of security to ancient software.  Then I have it proxied through the main web server on the new machine which listens to both a public IP address and a private one that it can use to talk to internal servers.  Something about 24.04 breaks roundcube, I know not what.

New Shell Servers Rocky8 and PopOS

     As previously noted, centos7, scientific7, and now centos-stream are going away.  Centos7 and scientific7 at the end of this month, and centos-stream in a few hours as Redhat has already dropped the repositories for it ahead of schedule.

     I’ve brought two new distros online, rocky8.eskimo.com is primarily a replacement for centos-stream, centos7, and scientific7.  It is much like Centos WAS before Redhat took it over.  Please let me know if there are applications that are presently missing or not working.

     The other new shell server I brought online is popos.eskimo.comPopos is a disguised Ubuntu (they are like Zorin, a Ubuntu with some of their own stuff placed on top).  Because of the popularity of the release with the Linux community at large and the fact that Ubuntu based systems are relatively easy to maintain, I’ve added it.

     If you have any other distros or services you would like to see here please contact nanook@eskimo.com.

Mail Lists and Some Aliases Out Of Service

Mail lists and some aliases are currently out of service owing to the aliases file on the client mail server somehow being truncated and half not there.  I am restoring from backups.  While this is in progress mail lists will not work as ALL aliases for smartlist are gone and many users aliases are gone at the moment.

(Update the aliases file is restored, lists are back in service)

On other news, while waiting for the new CPU chip to get here I had the machine sitting in BIOS configuration screen and after about three days it shut itself down spontaneously.

Since I had already replaced the motherboard (some bent pins on the CPU socket necessitated this), I went ahead and replaced the power supply.  Prior to replacing it, it would not run at all above 4.3Ghz and would not run stable even at 4Ghz, now it is running at 4.7Ghz, so there was some issue with the old supply, either a ripple problem or it would just not capable of supplying enough power on the 12v line.  Since this CPU can use as much as 540 watts it requires a very robust supply.