aboutsummaryrefslogtreecommitdiffstats
path: root/virt/kvm/kvm_main.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-12-08KVM: Use "new" memslot's address space ID instead of dedicated paramSean Christopherson1-13/+9
Now that the address space ID is stored in every slot, including fake slots used for deletion, use the slot's as_id instead of passing in the redundant information as a param to kvm_set_memslot(). This will greatly simplify future memslot work by avoiding passing a large number of variables around purely to honor @as_id. Drop a comment in the DELETE path about new->as_id being provided purely for debug, as that's now a lie. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <03189577be214ab8530a4b3a3ee3ed1c2f9e5815.1638817639.git.maciej.szmigiero@oracle.com>
2021-12-08KVM: Resync only arch fields when slots_arch_lock gets reacquiredMaciej S. Szmigiero1-20/+25
There is no need to copy the whole memslot data after releasing slots_arch_lock for a moment to install temporary memslots copy in kvm_set_memslot() since this lock only protects the arch field of each memslot. Just resync this particular field after reacquiring slots_arch_lock. Note, this also eliminates the need to manually clear the INVALID flag when restoring memslots; the "setting" of the INVALID flag was an unwanted side effect of copying the entire memslots. Since kvm_copy_memslots() has just one caller remaining now open-code it instead. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: tweak shortlog, note INVALID flag in changelog, revert comment] Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <b63035d114707792e9042f074478337f770dff6a.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-08KVM: Open code kvm_delete_memslot() into its only callerSean Christopherson1-25/+17
Fold kvm_delete_memslot() into __kvm_set_memory_region() to free up the "kvm_delete_memslot()" name for use in a future helper. The delete logic isn't so complex/long that it truly needs a helper, and it will be simplified a wee bit further in upcoming commits. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <2887631c31a82947faa488ab72f55f8c68b7c194.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-08KVM: Require total number of memslot pages to fit in an unsigned longSean Christopherson1-0/+19
Explicitly disallow creating more memslot pages than can fit in an unsigned long, KVM doesn't correctly handle a total number of memslot pages that doesn't fit in an unsigned long and remedying that would be a waste of time. For a 64-bit kernel, this is a nop as memslots are not allowed to overlap in the gfn address space. With a 32-bit kernel, userspace can at most address 3gb of virtual memory, whereas wrapping the total number of pages would require 4tb+ of guest physical memory. Even with x86's second address space for SMM, userspace would need to alias all of guest memory more than one _thousand_ times. And on older x86 hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those aliases even if userspace lied about guest.MAXPHYADDR. On 390 and arm64, this is a nop as they don't support 32-bit hosts. On x86, practically speaking this is simply acknowledging reality as the existing kvm_mmu_calculate_default_mmu_pages() assumes the total number of pages fits in an "unsigned long". On PPC, this is likely a nop as every flavor of PPC KVM assumes gfns (and gpas!) fit in unsigned long. arch/powerpc/kvm/book3s_32_mmu_host.c goes a step further and fails the build if CONFIG_PTE_64BIT=y, which presumably means that it does't support 64-bit physical addresses. On MIPS, this is also likely a nop as the core MMU helpers assume gpas fit in unsigned long, e.g. see kvm_mips_##name##_pte. And finally, RISC-V is a "don't care" as it doesn't exist in any release, i.e. there is no established ABI to break. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <1c2c91baf8e78acccd4dad38da591002e61c013c.1638817638.git.maciej.szmigiero@oracle.com>
2021-12-08KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s indexMarc Zyngier1-6/+7
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu index. Unfortunately, we're about to move rework the iterator, which requires this to be upgrade to an unsigned long. Let's bite the bullet and repaint all of it in one go. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-7-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: Convert the kvm->vcpus array to a xarrayMarc Zyngier1-6/+9
At least on arm64 and x86, the vcpus array is pretty huge (up to 1024 entries on x86) and is mostly empty in the majority of the cases (running 1k vcpu VMs is not that common). This mean that we end-up with a 4kB block of unused memory in the middle of the kvm structure. Instead of wasting away this memory, let's use an xarray instead, which gives us almost the same flexibility as a normal array, but with a reduced memory usage with smaller VMs. Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-6-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08KVM: Move wiping of the kvm->vcpus array to common codeMarc Zyngier1-2/+15
All architectures have similar loops iterating over the vcpus, freeing one vcpu at a time, and eventually wiping the reference off the vcpus array. They are also inconsistently taking the kvm->lock mutex when wiping the references from the array. Make this code common, which will simplify further changes. The locking is dropped altogether, as this should only be called when there is no further references on the kvm structure. Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Message-Id: <20211116160403.4074052-2-maz@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26KVM: downgrade two BUG_ONs to WARN_ON_ONCEPaolo Bonzini1-2/+4
This is not an unrecoverable situation. Users of kvm_read_guest_offset_cached and kvm_write_guest_offset_cached must expect the read/write to fail, and therefore it is possible to just return early with an error value. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18KVM: Disallow user memslot with size that exceeds "unsigned long"Sean Christopherson1-1/+2
Reject userspace memslots whose size exceeds the storage capacity of an "unsigned long". KVM's uAPI takes the size as u64 to support large slots on 64-bit hosts, but does not account for the size being truncated on 32-bit hosts in various flows. The access_ok() check on the userspace virtual address in particular casts the size to "unsigned long" and will check the wrong number of bytes. KVM doesn't actually support slots whose size doesn't fit in an "unsigned long", e.g. KVM's internal kvm_memory_slot.npages is an "unsigned long", not a "u64", and misc arch specific code follows that behavior. Fixes: fa3d315a4ce2 ("KVM: Validate userspace_addr of memslot when registered") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <20211104002531.1176691-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18KVM: Ensure local memslot copies operate on up-to-date arch-specific dataSean Christopherson1-16/+31
When modifying memslots, snapshot the "old" memslot and copy it to the "new" memslot's arch data after (re)acquiring slots_arch_lock. x86 can change a memslot's arch data while memslot updates are in-progress so long as it holds slots_arch_lock, thus snapshotting a memslot without holding the lock can result in the consumption of stale data. Fixes: b10a038e84d1 ("KVM: mmu: Add slots_arch_lock for memslot arch fields") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211104002531.1176691-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18Merge branch 'kvm-5.16-fixes' into kvm-masterPaolo Bonzini1-89/+11
* Fixes for Xen emulation * Kill kvm_map_gfn() / kvm_unmap_gfn() and broken gfn_to_pfn_cache * Fixes for migration of 32-bit nested guests on 64-bit hypervisor * Compilation fixes * More SEV cleanups
2021-11-18KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cacheDavid Woodhouse1-89/+11
In commit 7e2175ebd695 ("KVM: x86: Fix recording of guest steal time / preempted status") I removed the only user of these functions because it was basically impossible to use them safely. There are two stages to the GFN->PFN mapping; first through the KVM memslots to a userspace HVA and then through the page tables to translate that HVA to an underlying PFN. Invalidations of the former were being handled correctly, but no attempt was made to use the MMU notifiers to invalidate the cache when the HVA->GFN mapping changed. As a prelude to reinventing the gfn_to_pfn_cache with more usable semantics, rip it out entirely and untangle the implementation of the unsafe kvm_vcpu_map()/kvm_vcpu_unmap() functions from it. All current users of kvm_vcpu_map() also look broken right now, and will be dealt with separately. They broadly fall into two classes: * Those which map, access the data and immediately unmap. This is mostly gratuitous and could just as well use the existing user HVA, and could probably benefit from a gfn_to_hva_cache as they do so. * Those which keep the mapping around for a longer time, perhaps even using the PFN directly from the guest. These will need to be converted to the new gfn_to_pfn_cache and then kvm_vcpu_map() can be removed too. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211115165030.7422-8-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-17KVM: Move x86's perf guest info callbacks to generic KVMSean Christopherson1-0/+44
Move x86's perf guest callbacks into common KVM, as they are semantically identical to arm64's callbacks (the only other such KVM callbacks). arm64 will convert to the common versions in a future patch. Implement the necessary arm64 arch hooks now to avoid having to provide stubs or a temporary #define (from x86) to avoid arm64 compilation errors when CONFIG_GUEST_PERF_EVENTS=y. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com
2021-11-11KVM: generalize "bugged" VM to "dead" VMPaolo Bonzini1-5/+5
Generalize KVM_REQ_VM_BUGGED so that it can be called even in cases where it is by design that the VM cannot be operated upon. In this case any KVM_BUG_ON should still warn, so introduce a new flag kvm->vm_dead that is separate from kvm->vm_bugged. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDSJuergen Gross1-1/+1
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30KVM: Make kvm_make_vcpus_request_mask() use pre-allocated cpu_kick_maskVitaly Kuznetsov1-20/+9
kvm_make_vcpus_request_mask() already disables preemption so just like kvm_make_all_cpus_request_except() it can be switched to using pre-allocated per-cpu cpumasks. This allows for improvements for both users of the function: in Hyper-V emulation code 'tlb_flush' can now be dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask() gets rid of dynamic allocation. cpumask_available() checks in kvm_make_vcpu_request() and kvm_kick_many_cpus() can now be dropped as they checks for an impossible condition: kvm_init() makes sure per-cpu masks are allocated. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()Vitaly Kuznetsov1-6/+23
Allocating cpumask dynamically in zalloc_cpumask_var() is not ideal. Allocation is somewhat slow and can (in theory and when CPUMASK_OFFSTACK) fail. kvm_make_all_cpus_request_except() already disables preemption so we can use pre-allocated per-cpu cpumasks instead. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30KVM: Drop 'except' parameter from kvm_make_vcpus_request_mask()Vitaly Kuznetsov1-2/+1
Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for 'except' parameter so it can just be dropped. No functional change intended ©. Suggested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30KVM: Optimize kvm_make_vcpus_request_mask() a bitVitaly Kuznetsov1-35/+53
Iterating over set bits in 'vcpu_bitmap' should be faster than going through all vCPUs, especially when just a few bits are set. Drop kvm_make_vcpus_request_mask() call from kvm_make_all_cpus_request_except() to avoid handling the special case when 'vcpu_bitmap' is NULL, move the code to kvm_make_all_cpus_request_except() itself. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-30KVM: use vma_pages() helperYang Li1-1/+1
Use vma_pages function on vma object instead of explicit computation. Fix the following coccicheck warning: ./virt/kvm/kvm_main.c:3526:29-35: WARNING: Consider using vma_pages helper on vma Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Message-Id: <1632900526-119643-1-git-send-email-yang.lee@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-23KVM: Remove tlbs_dirtyLai Jiangshan1-8/+2
There is no user of tlbs_dirty. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210918005636.3675-4-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUsSean Christopherson1-3/+15
Check for a NULL cpumask_var_t when kicking multiple vCPUs via cpumask_available(), which performs a !NULL check if and only if cpumasks are configured to be allocated off-stack. This is a meaningless optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more importantly helps document that the NULL check is necessary even though all callers pass in a local variable. No functional change intended. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: Clean up benign vcpu->cpu data races when kicking vCPUsSean Christopherson1-8/+28
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu is read exactly once, and by ensuring the vCPU is booted from guest mode if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if kvm_request_needs_ipi() returns true. Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or kvm_request_needs_ipi()) means the target vCPU could get migrated (change vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and reading vcpu->mode. If that happens, the kick/IPI will be sent to the old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs. Although failing to kick the vCPU is not exactly ideal, practically speaking it cannot cause a functional issue unless there is also a bug in the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s behavior. The purpose of sending an IPI is purely to get a vCPU into the host (or out of reading SPTEs) so that the vCPU can recognize a change in state, e.g. a KVM_REQ_* request. If vCPU's handling of the state change is required for correctness, KVM must ensure either the vCPU sees the change before entering the guest, or that the sender sees the vCPU as running in guest mode. All architectures handle this by (a) sending the request before calling kvm_vcpu_kick() and (b) checking for requests _after_ setting vcpu->mode. x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to ensure it kicks and waits for vCPUs that started reading SPTEs _before_ MMU changes were finalized, but any vCPU that starts reading after MMU changes were finalized will see the new state and can continue on uninterrupted. For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g. x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied upon for functional correctness, e.g. in the dirty log case, userspace cannot assume it has a 100% complete log if vCPUs are still running. All that said, eliminate the benign race since the cost of doing so is an "extra" atomic cmpxchg() in the case where the target vCPU is loaded by the current pCPU or is not loaded at all. I.e. the kick will be skipped due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as opposed to the kick being skipped because of the cpu checks. Keep the "cpu != me" checks even though they appear useless/impossible at first glance. x86 processes guest IPI writes in a fast path that runs in IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or READING_SHADOW_PAGE_TABLES is perfectly reasonable. Note, a race with the cpu_online() check in kvm_vcpu_kick() likely persists, e.g. the vCPU could exit guest mode and get offlined between the cpu_online() check and the sending of smp_send_reschedule(). But, the online check appears to exist only to avoid a WARN in x86's native_smp_send_reschedule() that fires if the target CPU is not online. The reschedule WARN exists because CPU offlining takes the CPU out of the scheduling pool, i.e. the WARN is intended to detect the case where the kernel attempts to schedule a task on an offline CPU. The actual sending of the IPI is a non-issue as at worst it will simpy be dropped on the floor. In other words, KVM's usurping of the reschedule IPI could theoretically trigger a WARN if the stars align, but there will be no loss of functionality. [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a Cc: Venkatesh Srinivas <venkateshs@google.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel") Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-22KVM: do not shrink halt_poll_ns below grow_startSergey Senozhatsky1-1/+5
grow_halt_poll_ns() ignores values between 0 and halt_poll_ns_grow_start (10000 by default). However, when we shrink halt_poll_ns we may fall way below halt_poll_ns_grow_start and endup with halt_poll_ns values that don't make a lot of sense: like 1 or 9, or 19. VCPU1 trace (halt_poll_ns_shrink equals 2): VCPU1 grow 10000 VCPU1 shrink 5000 VCPU1 shrink 2500 VCPU1 shrink 1250 VCPU1 shrink 625 VCPU1 shrink 312 VCPU1 shrink 156 VCPU1 shrink 78 VCPU1 shrink 39 VCPU1 shrink 19 VCPU1 shrink 9 VCPU1 shrink 4 Mirror what grow_halt_poll_ns() does and set halt_poll_ns to 0 as soon as new shrink-ed halt_poll_ns value falls below halt_poll_ns_grow_start. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210902031100.252080-1-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEADPaolo Bonzini1-18/+1
KVM/arm64 updates for 5.15 - Page ownership tracking between host EL1 and EL2 - Rely on userspace page tables to create large stage-2 mappings - Fix incompatibility between pKVM and kmemleak - Fix the PMU reset state, and improve the performance of the virtual PMU - Move over to the generic KVM entry code - Address PSCI reset issues w.r.t. save/restore - Preliminary rework for the upcoming pKVM fixed feature - A bunch of MM cleanups - a vGIC fix for timer spurious interrupts - Various cleanups
2021-09-06KVM: stats: Add VM stat for remote tlb flush requestsJing Zhang1-0/+1
Add a new stat that counts the number of times a remote TLB flush is requested, regardless of whether it kicks vCPUs out of guest mode. This allows us to look at how often flushes are initiated. Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based TLB flush implementation, so apply it there too. Original-by: David Matlack <dmatlack@google.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210817002639.3856694-1-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-09-06KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()Sean Christopherson1-3/+0
Don't export KVM's MMU notifier count helpers, under no circumstance should any downstream module, including x86's vendor code, have a legitimate reason to piggyback KVM's MMU notifier logic. E.g in the x86 case, only KVM's MMU should be elevating the notifier count, and that code is always built into the core kvm.ko module. Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210902175951.1387989-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: stats: Add halt polling related histogram statsJing Zhang1-0/+12
Add three log histogram stats to record the distribution of time spent on successful polling, failed polling and VCPU wait. halt_poll_success_hist: Distribution of spent time for a successful poll. halt_poll_fail_hist: Distribution of spent time for a failed poll. halt_wait_hist: Distribution of time a VCPU has spent on waiting. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-6-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: stats: Add halt_wait_ns stats for all architecturesJing Zhang1-0/+4
Add simple stats halt_wait_ns to record the time a VCPU has spent on waiting for all architectures (not just powerpc). Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-20KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_rangeMaxim Levitsky1-2/+5
This together with previous patch, ensures that kvm_zap_gfn_range doesn't race with page fault running on another vcpu, and will make this page fault code retry instead. This is based on a patch suggested by Sean Christopherson: https://lkml.org/lkml/2021/7/22/1025 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-13KVM: Allow to have arch-specific per-vm debugfs filesPeter Xu1-1/+19
Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is defined but not yet used. It's called after kvm->debugfs_dentry is created, so it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should define their own versions when they want to create extra debugfs nodes. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-2-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-10Merge branch 'kvm-vmx-secctl' into HEADPaolo Bonzini1-2/+16
Merge common topic branch for 5.14-rc6 and 5.15 merge window.
2021-08-06KVM: Cache the last used slot index per vCPUDavid Matlack1-1/+21
The memslot for a given gfn is looked up multiple times during page fault handling. Avoid binary searching for it multiple times by caching the most recently used slot. There is an existing VM-wide last_used_slot but that does not work well for cases where vCPUs are accessing memory in different slots (see performance data below). Another benefit of caching the most recently use slot (versus looking up the slot once and passing around a pointer) is speeding up memslot lookups *across* faults and during spte prefetching. To measure the performance of this change I ran dirty_log_perf_test with 64 vCPUs and 64 memslots and measured "Populate memory time" and "Iteration 2 dirty memory time". Tests were ran with eptad=N to force dirty logging to use fast_page_fault so its performance could be measured. Config | Metric | Before | After ---------- | ----------------------------- | ------ | ------ tdp_mmu=Y | Populate memory time | 6.76s | 5.47s tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s tdp_mmu=N | Populate memory time | 20.4s | 18.7s tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s The "Iteration 2 dirty memory time" results are especially compelling because they are equivalent to running the same test with a single memslot. In other words, fast_page_fault performance no longer scales with the number of memslots. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-06KVM: Rename lru_slot to last_used_slotDavid Matlack1-2/+2
lru_slot is used to keep track of the index of the most-recently used memslot. The correct acronym would be "mru" but that is not a common acronym. So call it last_used_slot which is a bit more obvious. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210804222844.1419481-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-04KVM: Do not leak memory for duplicate debugfs directoriesPaolo Bonzini1-2/+16
KVM creates a debugfs directory for each VM in order to store statistics about the virtual machine. The directory name is built from the process pid and a VM fd. While generally unique, it is possible to keep a file descriptor alive in a way that causes duplicate directories, which manifests as these messages: [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present! Even though this should not happen in practice, it is more or less expected in the case of KVM for testcases that call KVM_CREATE_VM and close the resulting file descriptor repeatedly and in parallel. When this happens, debugfs_create_dir() returns an error but kvm_create_vm_debugfs() goes on to allocate stat data structs which are later leaked. The slow memory leak was spotted by syzkaller, where it caused OOM reports. Since the issue only affects debugfs, do a lookup before calling debugfs_create_dir, so that the message is downgraded and rate-limited. While at it, ensure kvm->debugfs_dentry is NULL rather than an error if it is not created. This fixes kvm_destroy_vm_debugfs, which was not checking IS_ERR_OR_NULL correctly. Cc: stable@vger.kernel.org Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM") Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03KVM: Don't take mmu_lock for range invalidation unless necessaryPaolo Bonzini1-13/+12
Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. This is possible now that memslot updates are blocked from range_start() to range_end(); that ensures that lock elision happens in both or none, and therefore that mmu_notifier_count updates (which must occur while holding mmu_lock for write) are always paired across start->end. Based on patches originally written by Ben Gardon. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-03KVM: Block memslot updates across range_start() and range_end()Paolo Bonzini1-4/+54
We would like to avoid taking mmu_lock for .invalidate_range_{start,end}() notifications that are unrelated to KVM. Because mmu_notifier_count must be modified while holding mmu_lock for write, and must always be paired across start->end to stay balanced, lock elision must happen in both or none. Therefore, in preparation for this change, this patch prevents memslot updates across range_start() and range_end(). Note, technically flag-only memslot updates could be allowed in parallel, but stalling a memslot update for a relatively short amount of time is not a scalability issue, and this is all more than complex enough. A long note on the locking: a previous version of the patch used an rwsem to block the memslot update while the MMU notifier run, but this resulted in the following deadlock involving the pseudo-lock tagged as "mmu_notifier_invalidate_range_start". ====================================================== WARNING: possible circular locking dependency detected 5.12.0-rc3+ #6 Tainted: G OE ------------------------------------------------------ qemu-system-x86/3069 is trying to acquire lock: ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190 but task is already holding lock: ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm] which lock already depends on the new lock. This corresponds to the following MMU notifier logic: invalidate_range_start take pseudo lock down_read() (*) release pseudo lock invalidate_range_end take pseudo lock (**) up_read() release pseudo lock At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock; at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock. This could cause a deadlock (ignoring for a second that the pseudo lock is not a lock): - invalidate_range_start waits on down_read(), because the rwsem is held by install_new_memslots - install_new_memslots waits on down_write(), because the rwsem is held till (another) invalidate_range_end finishes - invalidate_range_end sits waits on the pseudo lock, held by invalidate_range_start. Removing the fairness of the rwsem breaks the cycle (in lockdep terms, it would change the *shared* rwsem readers into *shared recursive* readers), so open-code the wait using a readers count and a spinlock. This also allows handling blockable and non-blockable critical section in the same way. Losing the rwsem fairness does theoretically allow MMU notifiers to block install_new_memslots forever. Note that mm/mmu_notifier.c's own retry scheme in mmu_interval_read_begin also uses wait/wake_up and is likewise not fair. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: Introduce kvm_get_kvm_safe()Peter Xu1-5/+15
Introduce this safe version of kvm_get_kvm() so that it can be called even during vm destruction. Use it in kvm_debugfs_open() and remove the verbose comment. Prepare to be used elsewhere. Signed-off-by: Peter Xu <peterx@redhat.com> Message-Id: <20210625153214.43106-3-peterx@redhat.com> [Preserve the comment in kvm_debugfs_open. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: Add infrastructure and macro to mark VM as buggedSean Christopherson1-5/+5
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <3a0998645c328bf0895f1290e61821b70f048549.1625186503.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: Get rid of kvm_get_pfn()Marc Zyngier1-8/+1
Nobody is using kvm_get_pfn() anymore. Get rid of it. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-7-maz@kernel.org
2021-08-02KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap()Marc Zyngier1-10/+0
Now that arm64 has stopped using kvm_is_transparent_hugepage(), we can remove it, as well as PageTransCompoundMap() which was only used by the former. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210726153552.1535838-5-maz@kernel.org
2021-07-27KVM: add missing compat KVM_CLEAR_DIRTY_LOGPaolo Bonzini1-0/+28
The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer, therefore it needs a compat ioctl implementation. Otherwise, 32-bit userspace fails to invoke it on 64-bit kernels; for x86 it might work fine by chance if the padding is zero, but not on big-endian architectures. Reported-by: Thomas Sattler Cc: stable@vger.kernel.org Fixes: 2a31b9db1535 ("kvm: introduce manual dirty log reprotect") Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-27KVM: use cpu_relax when halt pollingLi RongQing1-0/+1
SMT siblings share caches and other hardware, and busy halt polling will degrade its sibling performance if its sibling is working Sean Christopherson suggested as below: "Rather than disallowing halt-polling entirely, on x86 it should be sufficient to simply have the hardware thread yield to its sibling(s) via PAUSE. It probably won't get back all performance, but I would expect it to be close. This compiles on all KVM architectures, and AFAICT the intended usage of cpu_relax() is identical for all architectures." Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <20210727111247.55510-1-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull kvm fixes from Paolo Bonzini: - Allow again loading KVM on 32-bit non-PAE builds - Fixes for host SMIs on AMD - Fixes for guest SMIs on AMD - Fixes for selftests on s390 and ARM - Fix memory leak - Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: selftests: smm_test: Test SMM enter from L2 KVM: nSVM: Restore nested control upon leaving SMM KVM: nSVM: Fix L1 state corruption upon return from SMM KVM: nSVM: Introduce svm_copy_vmrun_state() KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails KVM: SVM: add module param to control the #SMI interception KVM: SVM: remove INIT intercept handler KVM: SVM: #SMI interception must not skip the instruction KVM: VMX: Remove vmx_msr_index from vmx.h KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc kvm: debugfs: fix memory leak in kvm_create_vm_debugfs KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR ...
2021-07-15kvm: debugfs: fix memory leak in kvm_create_vm_debugfsPavel Skripkin1-1/+1
In commit bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors") loop for filling debugfs_stat_data was copy-pasted 2 times, but in the second loop pointers are saved over pointers allocated in the first loop. All this causes is a memory leak, fix it. Fixes: bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors") Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Reviewed-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210701195500.27097-1-paskripkin@gmail.com> Reviewed-by: Jing Zhang <jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-07-14Merge tag 'kvm-s390-master-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEADPaolo Bonzini1-2/+17
KVM: selftests: Fixes - provide memory model for IBM z196 and zEC12 - do not require 64GB of memory
2021-06-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge misc updates from Andrew Morton: "191 patches. Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab, slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap, mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization, pagealloc, and memory-failure)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (191 commits) mm,hwpoison: make get_hwpoison_page() call get_any_page() mm,hwpoison: send SIGBUS with error virutal address mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes mm/page_alloc: allow high-order pages to be stored on the per-cpu lists mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA docs: remove description of DISCONTIGMEM arch, mm: remove stale mentions of DISCONIGMEM mm: remove CONFIG_DISCONTIGMEM m68k: remove support for DISCONTIGMEM arc: remove support for DISCONTIGMEM arc: update comment about HIGHMEM implementation alpha: remove DISCONTIGMEM and NUMA mm/page_alloc: move free_the_page mm/page_alloc: fix counting of managed_pages mm/page_alloc: improve memmap_pages dbg msg mm: drop SECTION_SHIFT in code comments mm/page_alloc: introduce vm.percpu_pagelist_high_fraction mm/page_alloc: limit the number of pages on PCP lists when reclaim is active mm/page_alloc: scale the number of pages that are batch freed ...
2021-06-29virt/kvm: use vma_lookup() instead of find_vma_intersection()Liam Howlett1-1/+1
vma_lookup() finds the vma of a specific address with a cleaner interface and is more readable. Link: https://lkml.kernel.org/r/20210521174745.2219620-11-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-49/+274
Pull kvm updates from Paolo Bonzini: "This covers all architectures (except MIPS) so I don't expect any other feature pull requests this merge window. ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits) KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled kvm: x86: disable the narrow guest module parameter on unload selftests: kvm: Allows userspace to handle emulation errors. kvm: x86: Allow userspace to handle emulation errors KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic KVM: x86: Enhance comments for MMU roles and nested transition trickiness KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU KVM: x86/mmu: Use MMU's role to determine PTTYPE KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers KVM: x86/mmu: Add a helper to calculate root from role_regs KVM: x86/mmu: Add helper to update paging metadata KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper KVM: x86/mmu: Get nested MMU's root level from the MMU's role ...
2021-06-28Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull scheduler udpates from Ingo Molnar: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There are new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/doc: Update the CPU capacity asymmetry bits sched/topology: Rework CPU capacity asymmetry detection sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag psi: Fix race between psi_trigger_create/destroy sched/fair: Introduce the burstable CFS controller sched/uclamp: Fix uclamp_tg_restrict() sched/rt: Fix Deadline utilization tracking during policy change sched/rt: Fix RT utilization tracking during policy change sched: Change task_struct::state sched,arch: Remove unused TASK_STATE offsets sched,timer: Use __set_current_state() sched: Add get_current_state() sched,perf,kvm: Fix preemption condition sched: Introduce task_is_running() sched: Unbreak wakeups sched/fair: Age the average idle time sched/cpufreq: Consider reduced CPU capacity in energy calculation sched/fair: Take thermal pressure into account while estimating energy thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure sched/fair: Return early from update_tg_cfs_load() if delta == 0 ...