I’m having difficulty accessing my Shoreline Post Office Box because they have taken to locking the building after hours. I’ve had a PO Box there for around three decades but this has only recently become an issue. So if you are making a payment that is near your expiration date, please use a debit or credit card rather than mailing a check because there may be a significant delay before I can pick up the check.
More Extended Maintenance
More extended maintenance is planned for the server that hosts /home directories and web service this Friday March 18th starting at around 11PM through Saturday around 4AM Pacific Daylight Time (GMT-7). This time frame is approximate at best.
The mate to the drive that failed developed two bad sectors right after this drive was replaced so I guess there is some value to the recommendation that you not buy drives in a RAID array from the same place since they will likely be manufactured at the same date and thus prone to simultaneous failure.
There also seems to be a firmware bug with these particular drives, they have two spare tracks for sector re-allocation but neither of them automatically re-allocated the failed sectors. I don’t have a spare now so have one on order but probably won’t make it in time for Friday’s maintenance.
This Friday, unless the spare arrives, the primary maintenance will be installing this new flash drive and copying existing data over to it. I don’t know how long this copy will take which is primarily why the uncertainty of the time interval.
If the replacement drive arrives by then I’m going to change out the drive as well even though it only has two flawed sectors. The firmware is supposed to handle automatic re-assignment internally, but two drives of the same model failed to do so, so I assume this is a firmware bug. The new drives have a 4x larger cache anyway so worth replacing from a performance standpoint anyway.
If the drive does not arrive in time for replacement Friday, then when it does arrive I’m going to attempt to manually force re-assignment, but I don’t want to do this until I have a spare just in case I brick the drive. At any rate, I will replace at the next time it is convenient to do a maintenance on a Friday night.
Only Accomplished Part of What I Planned
Replaced the failed drive but did not install the new flash owing to the mounting screw was missing from the motherboard and there is a card that is in the way. So instead of using the socket on the motherboard, I’m going to get a PCIe adapter card for a whole of $9, and stuff it in another PCIe slot at a later date. But the failed drive is replaced with a brand new drive and the new drive has 4x as much cache memory as the old so that should help performance a tad.
Eskimo North Extended Maintenance Outage 11PM March 12th – 4AM March 13th
Tonight I am going to perform hardware surgery on the machine that hosts home directories. As a result, ALL services except virtual private servers will be down for a number of hours.
The server which hosts the /home directory partition has an ill drive in the RAID array for this partition. It has about seven bad sectors which if they were HARD failed would not be a big deal, the drive would re-map them and life would go on, but they aren’t. Instead if you write them and read immediately they will pass but sometime in a week or so following reads will fail again.
If the mate to this drive in the RAID array were to fail, this would result in data corruption so I’m going to replace this drive tonight.
The other issue, when I tried to get the kernel upgraded on this machine last night, the drivers for the Fusion I/O flash drive would not compile under the 5.16 kernel. Earlier kernels have a bug that can result in either data corruption or privilege escalation, either of which are undesirable. The Fusion I/O folks tell me it may be a while before drivers are fixed as there were extensive changes to the kernel.
So I am going to replace the fusion I/O drive with a Western Digital Black 1TB drive which is natively supported by the Linux kernel. This drive is much larger so I am going to put both the root file system and boot block and database on it. It will take some time to copy all this data and change the boot block to this drive. The database copy should go fast as it is flash-to-flash but the rest will take several hours. This drive also does not have a conflict with the Broadcom NIC card, so when this is completed and I can remove this drive, I can restore the Broadcom NIC which handles hardware offloading properly.
This will affect ALL services EXCEPT for virtual private servers which do not depend upon the site wide /home directories. It will affect all web services including virtual domains, hosting packages, and virtual domains. I had hoped to put the new server in place and just transfer these services over to it and then fix the old but the security flaw being publicized in the Linux kernel no longer affords me this luxury.
This will also affect https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and our main website https://www.eskimo.com/.
Kernel Upgrades
Kernel upgrades are done except for four machines, one of them is just borked, somehow the image got corrupted so it’s being restored from backups. Could not get the drivers for the Fusion I/O drive to compile under 5.16 so I’ve ordered a more main brand conventional SSD, it won’t provide quite the I/O rate but it still should be adequate. So the main physical server is still on 5.13.19 for now. UUCP also won’t work with a modern kernel, issue with Centos’s start up routines wanting a feature that was deprecated. Zorin is bored. And Manjaro had to be restored from backups but hasn’t been brought totally current with upgrades yet as I am having some issues with the AUR processing.
Oops
I was doing half a dozen things at once and got into the wrong terminal and accidentally rebooted one of the physical servers before I had intended to and without stopping virtual servers first, and under this circumstance it takes forever and a day to reboot so it will probably be 15-30 minutes before the mail system and many of the shell servers is available again. My apologies but was trying to get kernel upgrades in place to address a potential security issue and two older machines required a lot of modification to run a modern kernel.
Fedora Broken
Something broke networking on Fedora. I thought it was the new kernel as it failed after rebooting, but booting off the old kernel, networking also does not start. So I am restoring from a backup and then will apply the kernel upgrade and then updates to bring it forward again.
Kernel Upgrades Between Wed March 9th and Saturday March 12th
Because a very nasty exploit has become known that could lead to privilege escalation, a kernel upgrade will be applied to all servers as soon as possible.
Because the exploit requires some sort of inside access to utilize, shell servers will be done first, then virtual private servers, and the rest. The physical machines will be done Friday evening between 11pm-midnight. Most things will take about five minutes but the web server could possibly take much longer as it requires compiling special drivers for a special SSD device we have on that machine for fast database access.
This will affect all of our servers and services. But with the exception of the web all outages will be brief. Most of the work will be done during the evenings 10pm-2am, especially shell servers and virtual private servers. This will impact all eskimo.com services including https://friendica.eskimo.com/, https://hubzilla.eskimo.com, https://nextcloud.eskimo.com, e-mail, shell servers, private virtual servers, and virtual domains.
Unscheduled Reboot
Around 12:35 I needed to reboot the server that hosts the home directories, web server, ubuntu, debian, and mint because something went wrong with the quota system that I was unable to identify and correct without a reboot.
Scheduled Maintenance March 5th Cancelled
The downtime for March 5th is cancelled. The drive in question has only a single failed sector. SMART estimates more than six years of life remaining. It would not automatically re-allocate the failed sector on a read, apparently it can only do this on a write which is odd since it’s part of a RAID array it could have gotten the data from the mirror drive. Oh well, instead of physically replacing it, I’m going to fail it (stop the RAID from using it, overwrite that sector to force a re-allocation, then restore it to service which will cause the raid software to mirror it from the operational drive. All this can be done without interrupting service.