Created attachment 301614 [details]
The configuration file used to Comile this kernel.
This behavior has persisted across 5.19.0, 5.19.1, and 5.19.2. While the kernel I am taking this example from is tainted (owing to using Intel development drivers for GPU virtualization), it is also occurring on non-tainted kernels on servers with no development or third party modules installed.
INFO: task CPU 2/KVM:2343 blocked for more than 1228 seconds.
[207177.050049] Tainted: G U I 5.19.2 #1
[207177.050050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[207177.050051] task:CPU 2/KVM state:D stack: 0 pid: 2343 ppid: 1 flags:0x00000002
[207177.050054] Call Trace:
[207177.050055] <TASK>
[207177.050056] __schedule+0x359/0x1400
[207177.050060] ? kvm_mmu_page_fault+0x1ee/0x980
[207177.050062] ? kvm_set_msr_common+0x31f/0x1060
[207177.050065] schedule+0x5f/0x100
[207177.050066] schedule_preempt_disabled+0x15/0x30
[207177.050068] __mutex_lock.constprop.0+0x4e2/0x750
[207177.050070] ? aa_file_perm+0x124/0x4f0
[207177.050071] __mutex_lock_slowpath+0x13/0x20
[207177.050072] mutex_lock+0x25/0x30
[207177.050075] intel_vgpu_emulate_mmio_read+0x5d/0x3b0 [kvmgt]
[207177.050084] intel_vgpu_rw+0xb8/0x1c0 [kvmgt]
[207177.050091] intel_vgpu_read+0x20d/0x250 [kvmgt]
[207177.050097] vfio_device_fops_read+0x1f/0x40
[207177.050100] vfs_read+0x9b/0x160
[207177.050102] __x64_sys_pread64+0x93/0xd0
[207177.050104] do_syscall_64+0x58/0x80
[207177.050106] ? kvm_on_user_return+0x84/0xe0
[207177.050107] ? fire_user_return_notifiers+0x37/0x70
[207177.050109] ? exit_to_user_mode_prepare+0x41/0x200
[207177.050111] ? syscall_exit_to_user_mode+0x1b/0x40
[207177.050112] ? do_syscall_64+0x67/0x80
[207177.050114] ? irqentry_exit+0x54/0x70
[207177.050115] ? sysvec_call_function_single+0x4b/0xa0
[207177.050116] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[207177.050118] RIP: 0033:0x7ff51131293f
[207177.050119] RSP: 002b:00007ff4ddffa260 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
[207177.050121] RAX: ffffffffffffffda RBX: 00005599a6835420 RCX: 00007ff51131293f
[207177.050122] RDX: 0000000000000004 RSI: 00007ff4ddffa2a8 RDI: 0000000000000027
[207177.050123] RBP: 0000000000000004 R08: 0000000000000000 R09: 00000000ffffffff
[207177.050124] R10: 0000000000065f10 R11: 0000000000000293 R12: 0000000000065f10
[207177.050124] R13: 00005599a6835330 R14: 0000000000000004 R15: 0000000000065f10
[207177.050126] </TASK>
I am seeing this on Intel i7-6700k, i7-6850k, and i7-9700k platforms.
This did not happen on 5.17 kernels, and 5.18 kernels never ran stable enough on my platforms to actually run them for more than a few minutes.
Likewise 6.0-rc1 has not been stable enough to run in production. After
less than three hours running on my workstation it locked hard with even the magic sys-request key being unresponsive and only power cycling the machine got it back.
The operating system in use for the host on all machines is Ubuntu 22.04.
Guests vary with Ubuntu 22.04 being the most common but also Mint, Debian, Manjaro, Centos, Fedora, ScientificLinux, Zorin, and Windows being in use.
I see the same issue manifest on platforms running only Ubuntu guests as with guests of varying operating systems.
The configuration file I used to compile this kernel is attached. I compiled it with gcc 12.1.0.
This behavior does not manifest itself instantly, typically the machine needs to be running 3-7 days before it does. Once it does guests keep stalling and restarting libvirtd does not help. Only thing that seems to is a hard reboot of the physical host. For this reason I believe the issue lies strictly with the host and not the guests.
I have listed it as a severity of high since it is completely service interrupting.
Category Archives: Uncategorized
Mail – Unannounced Boot of Some Machines
I had to reboot igloo which is one of our physical hosts which hosts the mail server among others. It was necessary because after making a backup of mail, it would not complete a boot with the CPU tasks hanging. Something was corrupted with the KVM hypervisor on the physical host. I did discover that it was missing some of the Linux firmware and installed that which was missing. I kind of doubt this was the issue but you never know.
Meanwhile, 5.19.2 has not tested good on my server so at this point kernel upgrades next Friday are off UNLESS 5.19.3 comes out between now and then.
Mail Server Maintenance
I am going to take the mail server down again at around 11pm for about 20 minutes to make a current backup as the existing backup is approximately 300 updates behind and half of those are security related.
Mail Server Returned To Service
The client mail server is returned to service at approximately 8:10PM however, there will be a brief introduction in approximately 15 minutes once updates are applied to bring it back current from it’s backed up state and new kernel is installed (it is running on a kernel from May).
Mail Server
The client mail server crashed this evening at approximately 7:25 PM Pacific Daylight Time. It will not boot, something has corrupted the initramfs file system, I am working on repairing it. Estimated downtime 30-90 minutes.
Kernel Upgrade Date Correction – August 26th 11PM Friday
It was brought to my attention that I erred in my announcement of the changed date of the kernel upgrade. I posted it would be Friday August 24th, but the correct date is Friday August 26th. A customer felt this was ambiguous since the 24th was on a Wednesday not Friday and asked me to clarify.
For future reference, kernel upgrades and most major service interrupting maintenance takes place Fridays usually starting at 11PM. The reason for this is to give the maximum amount of time before the business week to recover in the event of severe problems with the procedure in question.
Kernel Upgrade Postponed
A kernel upgrade for all of eskimo.com’s services scheduled for August 17th at 11pm PDT has been postponed until August 24th because kernel 5.19.2 came out this Wednesday after I had already built and partially deployed 5.19.1. 5.19.1 contained only three very small fixes so did not require extensive testing but 5.19.2 contains far more significant changes and many so there is a much greater potential for instability to be introduced. And one day is really too tight to deploy to all the machines so going to shoot for next Friday instead in order to give it time to soak and to deploy to all the machines. 5.19.0 has both performed extremely well and been stable so there is no real incentive to get this upgrade pushed out quickly.
Nextcloud Repaired
Nextcloud (https://nextcloud.eskimo.com) is repaired and up to the current stable release of 24.0.2.4.
There were a three apps that were broken but not listed as incompatible.
Nextcloud Is Borked
Nextcloud is down. Something went horribly wrong during an update last night. I tried to roll back to the previous version and it’s not working either. It appears to be an issue with apps but I haven’t been able to isolate which yet and this is a painful and slow process.
Kernel Upgrades Friday August 19th 11PM Pacific Daylight Time
I am planning a kernel upgrade this Friday, August 19th, at 11pm Pacific Daylight Time (GMT-0700).
This will affect all Eskimo North services, shell servers, e-mail, web hosting, other hosting, https://friendica.eskimo.com/, https://hubzilla.eskimo.com/, https://nextcloud.eskimo.com, https://yacy.eskimo.com/, and https://www.eskimo.com/.
I do not expect the downtime for any one service to exceed ten minutes and the whole process should be completed by approximately 11:30PM. I am expecting this to go reasonably smooth as 5.19.0 was very smooth and 5.19.1 only contained three very small fixes, one where a function was missing a return so if it did not return on some conditionals it could return with random results, and two that are bounds checks in the QEMU-KVM system that would not come into play unless something else went wrong.