Kernel upgrades have been completed. All servers are back in service. All NFS and NIS relationships are re-established. All services are up.
We will be upgrading to kernel 5.19.4 this Friday at 11PM. I don’t know if this will address the CPU stalls or not. The number of changes to the kernel since 5.19.3 are so voluminous that I don’t have time to review all of them to see if any of them impact this problem. I do know that the issue we had is not apparently in the KVM code itself as it has also occurred in the physical hosts and in at least four different parts of the kernel, so probably a function all of these are using or something doing wild writes in memory. The ticket I opened is still not resolved. But since we know that this problem exists at least from 5.19.0-5.19.3 and going back to 5.17 is not an option since it’s past EOF and there are some serious exploits not fixed in that kernel, moving forward is the only viable option at this point.
The expected interval is approximately 1/2 hour with no single service being down for more than about 10-15 minutes.
This will affect ALL of Eskimo’s host services including virtual private servers, web and e-mail hosting, and our fediverse services https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com/, and https://yacy.eskimo.com.
Because of a severe mailbomb I experienced the last few days, it was necessary to block incoming mail from several hundred .au domains with improperly configured mail servers that bounce mail to non-existent addresses by storing the e-mail, looking up the incoming address, and sending it back, rather than refusing it up front and at the same time don’t bother to check SPF, DMARC, or DKIM records to make sure the address they are bouncing too isn’t forged.
So if you are expecting mail from Australia and do not receive it, generate a ticket at https://www.eskimo.com/support/osTicket/ and let me know and I’ll checked to see if it’s on the blocked list. This ban is not intended to be permanent, just until the person initiating the mailbomb gets bored and moves on.
The mail system maintenance is complete.
I forgot the most important part of this message, I will be taking mail down for about ten minutes at 11pm tonight to make another backup to capture the corrected NFS configuration.
Created attachment 301614 [details] The configuration file used to Comile this kernel. This behavior has persisted across 5.19.0, 5.19.1, and 5.19.2. While the kernel I am taking this example from is tainted (owing to using Intel development drivers for GPU virtualization), it is also occurring on non-tainted kernels on servers with no development or third party modules installed. INFO: task CPU 2/KVM:2343 blocked for more than 1228 seconds. [207177.050049] Tainted: G U I 5.19.2 #1 [207177.050050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207177.050051] task:CPU 2/KVM state:D stack: 0 pid: 2343 ppid: 1 flags:0x00000002 [207177.050054] Call Trace: [207177.050055] <TASK> [207177.050056] __schedule+0x359/0x1400 [207177.050060] ? kvm_mmu_page_fault+0x1ee/0x980 [207177.050062] ? kvm_set_msr_common+0x31f/0x1060 [207177.050065] schedule+0x5f/0x100 [207177.050066] schedule_preempt_disabled+0x15/0x30 [207177.050068] __mutex_lock.constprop.0+0x4e2/0x750 [207177.050070] ? aa_file_perm+0x124/0x4f0 [207177.050071] __mutex_lock_slowpath+0x13/0x20 [207177.050072] mutex_lock+0x25/0x30 [207177.050075] intel_vgpu_emulate_mmio_read+0x5d/0x3b0 [kvmgt] [207177.050084] intel_vgpu_rw+0xb8/0x1c0 [kvmgt] [207177.050091] intel_vgpu_read+0x20d/0x250 [kvmgt] [207177.050097] vfio_device_fops_read+0x1f/0x40 [207177.050100] vfs_read+0x9b/0x160 [207177.050102] __x64_sys_pread64+0x93/0xd0 [207177.050104] do_syscall_64+0x58/0x80 [207177.050106] ? kvm_on_user_return+0x84/0xe0 [207177.050107] ? fire_user_return_notifiers+0x37/0x70 [207177.050109] ? exit_to_user_mode_prepare+0x41/0x200 [207177.050111] ? syscall_exit_to_user_mode+0x1b/0x40 [207177.050112] ? do_syscall_64+0x67/0x80 [207177.050114] ? irqentry_exit+0x54/0x70 [207177.050115] ? sysvec_call_function_single+0x4b/0xa0 [207177.050116] entry_SYSCALL_64_after_hwframe+0x63/0xcd [207177.050118] RIP: 0033:0x7ff51131293f [207177.050119] RSP: 002b:00007ff4ddffa260 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 [207177.050121] RAX: ffffffffffffffda RBX: 00005599a6835420 RCX: 00007ff51131293f [207177.050122] RDX: 0000000000000004 RSI: 00007ff4ddffa2a8 RDI: 0000000000000027 [207177.050123] RBP: 0000000000000004 R08: 0000000000000000 R09: 00000000ffffffff [207177.050124] R10: 0000000000065f10 R11: 0000000000000293 R12: 0000000000065f10 [207177.050124] R13: 00005599a6835330 R14: 0000000000000004 R15: 0000000000065f10 [207177.050126] </TASK> I am seeing this on Intel i7-6700k, i7-6850k, and i7-9700k platforms. This did not happen on 5.17 kernels, and 5.18 kernels never ran stable enough on my platforms to actually run them for more than a few minutes. Likewise 6.0-rc1 has not been stable enough to run in production. After less than three hours running on my workstation it locked hard with even the magic sys-request key being unresponsive and only power cycling the machine got it back. The operating system in use for the host on all machines is Ubuntu 22.04. Guests vary with Ubuntu 22.04 being the most common but also Mint, Debian, Manjaro, Centos, Fedora, ScientificLinux, Zorin, and Windows being in use. I see the same issue manifest on platforms running only Ubuntu guests as with guests of varying operating systems. The configuration file I used to compile this kernel is attached. I compiled it with gcc 12.1.0. This behavior does not manifest itself instantly, typically the machine needs to be running 3-7 days before it does. Once it does guests keep stalling and restarting libvirtd does not help. Only thing that seems to is a hard reboot of the physical host. For this reason I believe the issue lies strictly with the host and not the guests. I have listed it as a severity of high since it is completely service interrupting.
I had to reboot igloo which is one of our physical hosts which hosts the mail server among others. It was necessary because after making a backup of mail, it would not complete a boot with the CPU tasks hanging. Something was corrupted with the KVM hypervisor on the physical host. I did discover that it was missing some of the Linux firmware and installed that which was missing. I kind of doubt this was the issue but you never know.
Meanwhile, 5.19.2 has not tested good on my server so at this point kernel upgrades next Friday are off UNLESS 5.19.3 comes out between now and then.
I am going to take the mail server down again at around 11pm for about 20 minutes to make a current backup as the existing backup is approximately 300 updates behind and half of those are security related.
The client mail server is returned to service at approximately 8:10PM however, there will be a brief introduction in approximately 15 minutes once updates are applied to bring it back current from it’s backed up state and new kernel is installed (it is running on a kernel from May).
The client mail server crashed this evening at approximately 7:25 PM Pacific Daylight Time. It will not boot, something has corrupted the initramfs file system, I am working on repairing it. Estimated downtime 30-90 minutes.