aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel/kvm.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-09-06x86/kvm: Don't enable IRQ when IRQ enabled in kvm_waitLai Jiangshan1-2/+3
Commit f4e61f0c9add3 ("x86/kvm: Fix broken irq restoration in kvm_wait") replaced "local_irq_restore() when IRQ enabled" with "local_irq_enable() when IRQ enabled" to suppress a warnning. Although there is no similar debugging warnning for doing local_irq_enable() when IRQ enabled as doing local_irq_restore() in the same IRQ situation. But doing local_irq_enable() when IRQ enabled is no less broken as doing local_irq_restore() and we'd better avoid it. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210814035129.154242-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()Vitaly Kuznetsov1-25/+17
Simplify the code by making PV features shutdown happen in one place. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable all PV features on crashVitaly Kuznetsov1-12/+32
Crash shutdown handler only disables kvmclock and steal time, other PV features remain active so we risk corrupting memory or getting some side-effects in kdump kernel. Move crash handler to kvm.c and unify with CPU offline. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable kvmclock on all CPUs on shutdownVitaly Kuznetsov1-0/+1
Currenly, we disable kvmclock from machine_shutdown() hook and this only happens for boot CPU. We need to disable it for all CPUs to guard against memory corruption e.g. on restore from hibernate. Note, writing '0' to kvmclock MSR doesn't clear memory location, it just prevents hypervisor from updating the location so for the short while after write and while CPU is still alive, the clock remains usable and correct so we don't need to switch to some other clocksource. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Teardown PV features on boot CPU as wellVitaly Kuznetsov1-16/+40
Various PV features (Async PF, PV EOI, steal time) work through memory shared with hypervisor and when we restore from hibernation we must properly teardown all these features to make sure hypervisor doesn't write to stale locations after we jump to the previously hibernated kernel (which can try to place anything there). For secondary CPUs the job is already done by kvm_cpu_down_prepare(), register syscore ops to do the same for boot CPU. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-05x86/kvm: Fix pr_info() for async PF setup/teardownVitaly Kuznetsov1-2/+2
'pr_fmt' already has 'kvm-guest: ' so 'KVM' prefix is redundant. "Unregister pv shared memory" is very ambiguous, it's hard to say which particular PV feature it relates to. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-68/+60
Pull kvm updates from Paolo Bonzini: "This is a large update by KVM standards, including AMD PSP (Platform Security Processor, aka "AMD Secure Technology") and ARM CoreSight (debug and trace) changes. ARM: - CoreSight: Add support for ETE and TRBE - Stage-2 isolation for the host kernel when running in protected mode - Guest SVE support when running in nVHE mode - Force W^X hypervisor mappings in nVHE mode - ITS save/restore for guests using direct injection with GICv4.1 - nVHE panics now produce readable backtraces - Guest support for PTP using the ptp_kvm driver - Performance improvements in the S2 fault handler x86: - AMD PSP driver changes - Optimizations and cleanup of nested SVM code - AMD: Support for virtual SPEC_CTRL - Optimizations of the new MMU code: fast invalidation, zap under read lock, enable/disably dirty page logging under read lock - /dev/kvm API for AMD SEV live migration (guest API coming soon) - support SEV virtual machines sharing the same encryption context - support SGX in virtual machines - add a few more statistics - improved directed yield heuristics - Lots and lots of cleanups Generic: - Rework of MMU notifier interface, simplifying and optimizing the architecture-specific code - a handful of "Get rid of oprofile leftovers" patches - Some selftests improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits) KVM: selftests: Speed up set_memory_region_test selftests: kvm: Fix the check of return value KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt() KVM: SVM: Skip SEV cache flush if no ASIDs have been used KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids() KVM: SVM: Drop redundant svm_sev_enabled() helper KVM: SVM: Move SEV VMCB tracking allocation to sev.c KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup() KVM: SVM: Unconditionally invoke sev_hardware_teardown() KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported) KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features KVM: SVM: Move SEV module params/variables to sev.c KVM: SVM: Disable SEV/SEV-ES if NPT is disabled KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails KVM: SVM: Zero out the VMCB array used to track SEV ASID association x86/sev: Drop redundant and potentially misleading 'sev_enabled' KVM: x86: Move reverse CPUID helpers to separate header file KVM: x86: Rename GPR accessors to make mode-aware variants the defaults ...
2021-04-29Merge tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-3/+8
Pull x86 tlb updates from Ingo Molnar: "The x86 MM changes in this cycle were: - Implement concurrent TLB flushes, which overlaps the local TLB flush with the remote TLB flush. In testing this improved sysbench performance measurably by a couple of percentage points, especially if TLB-heavy security mitigations are active. - Further micro-optimizations to improve the performance of TLB flushes" * tag 'x86-mm-2021-04-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Micro-optimize smp_call_function_many_cond() smp: Inline on_each_cpu_cond() and on_each_cpu() x86/mm/tlb: Remove unnecessary uses of the inline keyword cpumask: Mark functions as pure x86/mm/tlb: Do not make is_lazy dirty for no reason x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() smp: Run functions concurrently in smp_call_function_many_cond()
2021-04-26Merge tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull x86 alternatives/paravirt updates from Borislav Petkov: "First big cleanup to the paravirt infra to use alternatives and thus eliminate custom code patching. For that, the alternatives infrastructure is extended to accomodate paravirt's needs and, as a result, a lot of paravirt patching code goes away, leading to a sizeable cleanup and simplification. Work by Juergen Gross" * tag 'x86_alternatives_for_v5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/paravirt: Have only one paravirt patch function x86/paravirt: Switch functions with custom code to ALTERNATIVE x86/paravirt: Add new PVOP_ALT* macros to support pvops in ALTERNATIVEs x86/paravirt: Switch iret pvops to ALTERNATIVE x86/paravirt: Simplify paravirt macros x86/paravirt: Remove no longer needed 32-bit pvops cruft x86/paravirt: Add new features for paravirt patching x86/alternative: Use ALTERNATIVE_TERNARY() in _static_cpu_has() x86/alternative: Support ALTERNATIVE_TERNARY x86/alternative: Support not-feature x86/paravirt: Switch time pvops functions to use static_call() static_call: Add function to query current function static_call: Move struct static_call_key definition to static_call_types.h x86/alternative: Merge include files x86/alternative: Drop unused feature parameter from ALTINSTR_REPLACEMENT()
2021-04-22Merge branch 'kvm-sev-cgroup' into HEADPaolo Bonzini1-13/+10
2021-04-19x86/kvm: Don't bother __pv_cpu_mask when !CONFIG_SMPWanpeng Li1-63/+55
Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's move it inside CONFIG_SMP. In addition, we can avoid define and alloc __pv_cpu_mask when !CONFIG_SMP and get rid of 'alloc' variable in kvm_alloc_cpumask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1617941911-5338-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-18x86/kvm: Fix broken irq restoration in kvm_waitWanpeng Li1-13/+10
After commit 997acaf6b4b59c (lockdep: report broken irq restoration), the guest splatting below during boot: raw_local_irq_restore() called with IRQs enabled WARNING: CPU: 1 PID: 169 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x26/0x30 Modules linked in: hid_generic usbhid hid CPU: 1 PID: 169 Comm: systemd-udevd Not tainted 5.11.0+ #25 RIP: 0010:warn_bogus_irq_restore+0x26/0x30 Call Trace: kvm_wait+0x76/0x90 __pv_queued_spin_lock_slowpath+0x285/0x2e0 do_raw_spin_lock+0xc9/0xd0 _raw_spin_lock+0x59/0x70 lockref_get_not_dead+0xf/0x50 __legitimize_path+0x31/0x60 legitimize_root+0x37/0x50 try_to_unlazy_next+0x7f/0x1d0 lookup_fast+0xb0/0x170 path_openat+0x165/0x9b0 do_filp_open+0x99/0x110 do_sys_openat2+0x1f1/0x2e0 do_sys_open+0x5c/0x80 __x64_sys_open+0x21/0x30 do_syscall_64+0x32/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae The new consistency checking, expects local_irq_save() and local_irq_restore() to be paired and sanely nested, and therefore expects local_irq_restore() to be called with irqs disabled. The irqflags handling in kvm_wait() which ends up doing: local_irq_save(flags); safe_halt(); local_irq_restore(flags); instead triggers it. This patch fixes it by using local_irq_disable()/enable() directly. Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1615791328-2735-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-03-11x86/paravirt: Switch time pvops functions to use static_call()Juergen Gross1-1/+1
The time pvops functions are the only ones left which might be used in 32-bit mode and which return a 64-bit value. Switch them to use the static_call() mechanism instead of pvops, as this allows quite some simplification of the pvops implementation. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210311142319.4723-5-jgross@suse.com
2021-03-06x86/mm/tlb: Flush remote and local TLBs concurrentlyNadav Amit1-3/+8
To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Signed-off-by: Nadav Amit <namit@vmware.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Michael Kelley <mikelley@microsoft.com> # Hyper-v parts Reviewed-by: Juergen Gross <jgross@suse.com> # Xen and paravirt parts Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20210220231712.2475218-5-namit@vmware.com
2020-10-28x86/kvm: Enable 15-bit extension when KVM_FEATURE_MSI_EXT_DEST_ID detectedDavid Woodhouse1-0/+6
This allows the host to indicate that MSI emulation supports 15-bit destination IDs, allowing up to 32768 CPUs without interrupt remapping. cf. https://patchwork.kernel.org/patch/11816693/ for qemu Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20201024213535.443185-36-dwmw2@infradead.org
2020-10-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+1
Pull KVM updates from Paolo Bonzini: "For x86, there is a new alternative and (in the future) more scalable implementation of extended page tables that does not need a reverse map from guest physical addresses to host physical addresses. For now it is disabled by default because it is still lacking a few of the existing MMU's bells and whistles. However it is a very solid piece of work and it is already available for people to hammer on it. Other updates: ARM: - New page table code for both hypervisor and guest stage-2 - Introduction of a new EL2-private host context - Allow EL2 to have its own private per-CPU variables - Support of PMU event filtering - Complete rework of the Spectre mitigation PPC: - Fix for running nested guests with in-kernel IRQ chip - Fix race condition causing occasional host hard lockup - Minor cleanups and bugfixes x86: - allow trapping unknown MSRs to userspace - allow userspace to force #GP on specific MSRs - INVPCID support on AMD - nested AMD cleanup, on demand allocation of nested SVM state - hide PV MSRs and hypercalls for features not enabled in CPUID - new test for MSR_IA32_TSC writes from host and guest - cleanups: MMU, CPUID, shared MSRs - LAPIC latency optimizations ad bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits) kvm: x86/mmu: NX largepage recovery for TDP MMU kvm: x86/mmu: Don't clear write flooding count for direct roots kvm: x86/mmu: Support MMIO in the TDP MMU kvm: x86/mmu: Support write protection for nesting in tdp MMU kvm: x86/mmu: Support disabling dirty logging for the tdp MMU kvm: x86/mmu: Support dirty logging for the TDP MMU kvm: x86/mmu: Support changed pte notifier in tdp MMU kvm: x86/mmu: Add access tracking for tdp_mmu kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU kvm: x86/mmu: Add TDP MMU PF handler kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg kvm: x86/mmu: Support zapping SPTEs in the TDP MMU KVM: Cache as_id in kvm_memory_slot kvm: x86/mmu: Add functions to handle changed TDP SPTEs kvm: x86/mmu: Allocate and free TDP MMU roots kvm: x86/mmu: Init / Uninit the TDP MMU kvm: x86/mmu: Introduce tdp_iter KVM: mmu: extract spte.h and spte.c KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp ...
2020-10-14Merge tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-6/+29
Pull x86 SEV-ES support from Borislav Petkov: "SEV-ES enhances the current guest memory encryption support called SEV by also encrypting the guest register state, making the registers inaccessible to the hypervisor by en-/decrypting them on world switches. Thus, it adds additional protection to Linux guests against exfiltration, control flow and rollback attacks. With SEV-ES, the guest is in full control of what registers the hypervisor can access. This is provided by a guest-host exchange mechanism based on a new exception vector called VMM Communication Exception (#VC), a new instruction called VMGEXIT and a shared Guest-Host Communication Block which is a decrypted page shared between the guest and the hypervisor. Intercepts to the hypervisor become #VC exceptions in an SEV-ES guest so in order for that exception mechanism to work, the early x86 init code needed to be made able to handle exceptions, which, in itself, brings a bunch of very nice cleanups and improvements to the early boot code like an early page fault handler, allowing for on-demand building of the identity mapping. With that, !KASLR configurations do not use the EFI page table anymore but switch to a kernel-controlled one. The main part of this series adds the support for that new exchange mechanism. The goal has been to keep this as much as possibly separate from the core x86 code by concentrating the machinery in two SEV-ES-specific files: arch/x86/kernel/sev-es-shared.c arch/x86/kernel/sev-es.c Other interaction with core x86 code has been kept at minimum and behind static keys to minimize the performance impact on !SEV-ES setups. Work by Joerg Roedel and Thomas Lendacky and others" * tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) x86/sev-es: Use GHCB accessor for setting the MMIO scratch buffer x86/sev-es: Check required CPU features for SEV-ES x86/efi: Add GHCB mappings when SEV-ES is active x86/sev-es: Handle NMI State x86/sev-es: Support CPU offline/online x86/head/64: Don't call verify_cpu() on starting APs x86/smpboot: Load TSS and getcpu GDT entry before loading IDT x86/realmode: Setup AP jump table x86/realmode: Add SEV-ES specific trampoline entry point x86/vmware: Add VMware-specific handling for VMMCALL under SEV-ES x86/kvm: Add KVM-specific VMMCALL handling under SEV-ES x86/paravirt: Allow hypervisor-specific VMMCALL handling under SEV-ES x86/sev-es: Handle #DB Events x86/sev-es: Handle #AC Events x86/sev-es: Handle VMMCALL Events x86/sev-es: Handle MWAIT/MWAITX Events x86/sev-es: Handle MONITOR/MONITORX Events x86/sev-es: Handle INVD Events x86/sev-es: Handle RDPMC Events x86/sev-es: Handle RDTSC(P) Events ...
2020-09-28cpuidle-haltpoll: fix error comments in arch_haltpoll_disableLi Qiang1-1/+1
The 'arch_haltpoll_disable' is used to disable guest halt poll. Correct the comments. Fixes: a1c4423b02b21 ("cpuidle-haltpoll: disable host side polling when kvm virtualized") Signed-off-by: Li Qiang <liq3ea@163.com> Message-Id: <20200924155800.4939-1-liq3ea@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-20Revert "KVM: Check the allocation of pv cpu mask"Vitaly Kuznetsov1-19/+3
The commit 0f990222108d ("KVM: Check the allocation of pv cpu mask") we have in 5.9-rc5 has two issue: 1) Compilation fails for !CONFIG_SMP, see: https://bugzilla.kernel.org/show_bug.cgi?id=209285 2) This commit completely disables PV TLB flush, see https://lore.kernel.org/kvm/87y2lrnnyf.fsf@vitty.brq.redhat.com/ The allocation problem is likely a theoretical one, if we don't have memory that early in boot process we're likely doomed anyway. Let's solve it properly later. This reverts commit 0f990222108d214a0924d920e6095b58107d7b59. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12x86/kvm: don't forget to ACK async PF IRQVitaly Kuznetsov1-0/+2
Merge commit 26d05b368a5c0 ("Merge branch 'kvm-async-pf-int' into HEAD") tried to adapt the new interrupt based async PF mechanism to the newly introduced IDTENTRY magic but unfortunately it missed the fact that DEFINE_IDTENTRY_SYSVEC() doesn't call ack_APIC_irq() on its own and all DEFINE_IDTENTRY_SYSVEC() users have to call it manually. As the result all multi-CPU KVM guest hang on boot when KVM_FEATURE_ASYNC_PF_INT is present. The breakage went unnoticed because no KVM userspace (e.g. QEMU) currently set it (and thus async PF mechanism is currently disabled) but we're about to change that. Fixes: 26d05b368a5c0 ("Merge branch 'kvm-async-pf-int' into HEAD") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200908135350.355053-3-vkuznets@redhat.com> Tested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-12x86/kvm: properly use DEFINE_IDTENTRY_SYSVEC() macroVitaly Kuznetsov1-4/+0
DEFINE_IDTENTRY_SYSVEC() already contains irqentry_enter()/ irqentry_exit(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200908135350.355053-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11KVM: Check the allocation of pv cpu maskHaiwei Li1-3/+19
check the allocation of per-cpu __pv_cpu_mask. Initialize ops only when successful. Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <d59f05df-e6d3-3d31-a036-cc25a2b2f33f@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-09x86/kvm: Add KVM-specific VMMCALL handling under SEV-ESTom Lendacky1-6/+29
Implement the callbacks to copy the processor state required by KVM to the GHCB. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> [ jroedel@suse.de: - Split out of a larger patch - Adapt to different callback functions ] Co-developed-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200907131613.12703-64-joro@8bytes.org
2020-08-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-37/+81
Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...
2020-07-24x86/entry: Cleanup idtentry_enter/exitThomas Gleixner1-3/+3
Remove the temporary defines and fixup all references. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20200722220520.855839271@linutronix.de
2020-07-08x86/kvm: Add "nopvspin" parameter to disable PV spinlocksZhenzhong Duan1-7/+32
There are cases where a guest tries to switch spinlocks to bare metal behavior (e.g. by setting "xen_nopvspin" on XEN platform and "hv_nopvspin" on HYPER_V). That feature is missed on KVM, add a new parameter "nopvspin" to disable PV spinlocks for KVM guest. The new 'nopvspin' parameter will also replace Xen and Hyper-V specific parameters in future patches. Define variable nopvsin as global because it will be used in future patches as above. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08x86/kvm: Change print code to use pr_*() formatZhenzhong Duan1-9/+13
pr_*() is preferred than printk(KERN_* ...), after change all the print in arch/x86/kernel/kvm.c will have "kvm-guest: xxx" style. No functional change. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08Revert "KVM: X86: Fix setup the virt_spin_lock_key before static key get initialized"Zhenzhong Duan1-9/+3
This reverts commit 34226b6b70980a8f81fff3c09a2c889f77edeeff. Commit 8990cac6e5ea ("x86/jump_label: Initialize static branching early") adds jump_label_init() call in setup_arch() to make static keys initialized early, so we could use the original simpler code again. The similar change for XEN is in commit 090d54bcbc54 ("Revert "x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized"") Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-08Merge branch 'kvm-async-pf-int' into HEADPaolo Bonzini1-13/+34
2020-07-06x86/entry: Rename idtentry_enter/exit_cond_rcu() to idtentry_enter/exit()Andy Lutomirski1-3/+3
They were originally called _cond_rcu because they were special versions with conditional RCU handling. Now they're the standard entry and exit path, so the _cond_rcu part is just confusing. Drop it. Also change the signature to make them more extensible and more foolproof. No functional change -- it's pure refactoring. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/247fc67685263e0b673e1d7f808182d28ff80359.1593795633.git.luto@kernel.org
2020-06-15KVM: x86: Switch KVM guest to using interrupts for page ready APF deliveryVitaly Kuznetsov1-15/+33
KVM now supports using interrupt for 'page ready' APF event delivery and legacy mechanism was deprecated. Switch KVM guests to the new one. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-9-vkuznets@redhat.com> [Use HYPERVISOR_CALLBACK_VECTOR instead of a separate vector. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-13Merge tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-6/+9
Pull x86 entry updates from Thomas Gleixner: "The x86 entry, exception and interrupt code rework This all started about 6 month ago with the attempt to move the Posix CPU timer heavy lifting out of the timer interrupt code and just have lockless quick checks in that code path. Trivial 5 patches. This unearthed an inconsistency in the KVM handling of task work and the review requested to move all of this into generic code so other architectures can share. Valid request and solved with another 25 patches but those unearthed inconsistencies vs. RCU and instrumentation. Digging into this made it obvious that there are quite some inconsistencies vs. instrumentation in general. The int3 text poke handling in particular was completely unprotected and with the batched update of trace events even more likely to expose to endless int3 recursion. In parallel the RCU implications of instrumenting fragile entry code came up in several discussions. The conclusion of the x86 maintainer team was to go all the way and make the protection against any form of instrumentation of fragile and dangerous code pathes enforcable and verifiable by tooling. A first batch of preparatory work hit mainline with commit d5f744f9a2ac ("Pull x86 entry code updates from Thomas Gleixner") That (almost) full solution introduced a new code section '.noinstr.text' into which all code which needs to be protected from instrumentation of all sorts goes into. Any call into instrumentable code out of this section has to be annotated. objtool has support to validate this. Kprobes now excludes this section fully which also prevents BPF from fiddling with it and all 'noinstr' annotated functions also keep ftrace off. The section, kprobes and objtool changes are already merged. The major changes coming with this are: - Preparatory cleanups - Annotating of relevant functions to move them into the noinstr.text section or enforcing inlining by marking them __always_inline so the compiler cannot misplace or instrument them. - Splitting and simplifying the idtentry macro maze so that it is now clearly separated into simple exception entries and the more interesting ones which use interrupt stacks and have the paranoid handling vs. CR3 and GS. - Move quite some of the low level ASM functionality into C code: - enter_from and exit to user space handling. The ASM code now calls into C after doing the really necessary ASM handling and the return path goes back out without bells and whistels in ASM. - exception entry/exit got the equivivalent treatment - move all IRQ tracepoints from ASM to C so they can be placed as appropriate which is especially important for the int3 recursion issue. - Consolidate the declaration and definition of entry points between 32 and 64 bit. They share a common header and macros now. - Remove the extra device interrupt entry maze and just use the regular exception entry code. - All ASM entry points except NMI are now generated from the shared header file and the corresponding macros in the 32 and 64 bit entry ASM. - The C code entry points are consolidated as well with the help of DEFINE_IDTENTRY*() macros. This allows to ensure at one central point that all corresponding entry points share the same semantics. The actual function body for most entry points is in an instrumentable and sane state. There are special macros for the more sensitive entry points, e.g. INT3 and of course the nasty paranoid #NMI, #MCE, #DB and #DF. They allow to put the whole entry instrumentation and RCU handling into safe places instead of the previous pray that it is correct approach. - The INT3 text poke handling is now completely isolated and the recursion issue banned. Aside of the entry rework this required other isolation work, e.g. the ability to force inline bsearch. - Prevent #DB on fragile entry code, entry relevant memory and disable it on NMI, #MC entry, which allowed to get rid of the nested #DB IST stack shifting hackery. - A few other cleanups and enhancements which have been made possible through this and already merged changes, e.g. consolidating and further restricting the IDT code so the IDT table becomes RO after init which removes yet another popular attack vector - About 680 lines of ASM maze are gone. There are a few open issues: - An escape out of the noinstr section in the MCE handler which needs some more thought but under the aspect that MCE is a complete trainwreck by design and the propability to survive it is low, this was not high on the priority list. - Paravirtualization When PV is enabled then objtool complains about a bunch of indirect calls out of the noinstr section. There are a few straight forward ways to fix this, but the other issues vs. general correctness were more pressing than parawitz. - KVM KVM is inconsistent as well. Patches have been posted, but they have not yet been commented on or picked up by the KVM folks. - IDLE Pretty much the same problems can be found in the low level idle code especially the parts where RCU stopped watching. This was beyond the scope of the more obvious and exposable problems and is on the todo list. The lesson learned from this brain melting exercise to morph the evolved code base into something which can be validated and understood is that once again the violation of the most important engineering principle "correctness first" has caused quite a few people to spend valuable time on problems which could have been avoided in the first place. The "features first" tinkering mindset really has to stop. With that I want to say thanks to everyone involved in contributing to this effort. Special thanks go to the following people (alphabetical order): Alexandre Chartre, Andy Lutomirski, Borislav Petkov, Brian Gerst, Frederic Weisbecker, Josh Poimboeuf, Juergen Gross, Lai Jiangshan, Macro Elver, Paolo Bonzin,i Paul McKenney, Peter Zijlstra, Vitaly Kuznetsov, and Will Deacon" * tag 'x86-entry-2020-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (142 commits) x86/entry: Force rcu_irq_enter() when in idle task x86/entry: Make NMI use IDTENTRY_RAW x86/entry: Treat BUG/WARN as NMI-like entries x86/entry: Unbreak __irqentry_text_start/end magic x86/entry: __always_inline CR2 for noinstr lockdep: __always_inline more for noinstr x86/entry: Re-order #DB handler to avoid *SAN instrumentation x86/entry: __always_inline arch_atomic_* for noinstr x86/entry: __always_inline irqflags for noinstr x86/entry: __always_inline debugreg for noinstr x86/idt: Consolidate idt functionality x86/idt: Cleanup trap_init() x86/idt: Use proper constants for table size x86/idt: Add comments about early #PF handling x86/idt: Mark init only functions __init x86/entry: Rename trace_hardirqs_off_prepare() x86/entry: Clarify irq_{enter,exit}_rcu() x86/entry: Remove DBn stacks x86/entry: Remove debug IDT frobbing x86/entry: Optimize local_db_save() for virt ...
2020-06-11x86/entry: Switch page fault exception to IDTENTRY_RAWThomas Gleixner1-6/+9
Convert page fault exceptions to IDTENTRY_RAW: - Implement the C entry point with DEFINE_IDTENTRY_RAW - Add the CR2 read into the exception handler - Add the idtentry_enter/exit_cond_rcu() invocations in in the regular page fault handler and in the async PF part. - Emit the ASM stub with DECLARE_IDTENTRY_RAW - Remove the ASM idtentry in 64-bit - Remove the CR2 read from 64-bit - Remove the open coded ASM entry code in 32-bit - Fix up the XEN/PV code - Remove the old prototypes No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20200521202118.238455120@linutronix.de
2020-06-04x86/kvm: Remove defunct KVM_DEBUG_FS KconfigSean Christopherson1-1/+0
Remove KVM_DEBUG_FS, which can easily be misconstrued as controlling KVM-as-a-host. The sole user of CONFIG_KVM_DEBUG_FS was removed by commit cfd8983f03c7b ("x86, locking/spinlocks: Remove ticket (spin)lock implementation"). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200528031121.28904-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov1-8/+8
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-19x86/kvm: Restrict ASYNC_PF to user spaceThomas Gleixner1-93/+7
The async page fault injection into kernel space creates more problems than it solves. The host has absolutely no knowledge about the state of the guest if the fault happens in CPL0. The only restriction for the host is interrupt disabled state. If interrupts are enabled in the guest then the exception can hit arbitrary code. The HALT based wait in non-preemotible code is a hacky replacement for a proper hypercall. For the ongoing work to restrict instrumentation and make the RCU idle interaction well defined the required extra work for supporting async pagefault in CPL0 is just not justified and creates complexity for a dubious benefit. The CPL3 injection is well defined and does not cause any issues as it is more or less the same as a regular page fault from CPL3. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.369802541@linutronix.de
2020-05-19x86/kvm: Sanitize kvm_async_pf_task_wait()Thomas Gleixner1-60/+141
While working on the entry consolidation I stumbled over the KVM async page fault handler and kvm_async_pf_task_wait() in particular. It took me a while to realize that the randomly sprinkled around rcu_irq_enter()/exit() invocations are just cargo cult programming. Several patches "fixed" RCU splats by curing the symptoms without noticing that the code is flawed from a design perspective. The main problem is that this async injection is not based on a proper handshake mechanism and only respects the minimal requirement, i.e. the guest is not in a state where it has interrupts disabled. Aside of that the actual code is a convoluted one fits it all swiss army knife. It is invoked from different places with different RCU constraints: 1) Host side: vcpu_enter_guest() kvm_x86_ops->handle_exit() kvm_handle_page_fault() kvm_async_pf_task_wait() The invocation happens from fully preemptible context. 2) Guest side: The async page fault interrupted: a) user space b) preemptible kernel code which is not in a RCU read side critical section c) non-preemtible kernel code or a RCU read side critical section or kernel code with CONFIG_PREEMPTION=n which allows not to differentiate between #2b and #2c. RCU is watching for: #1 The vCPU exited and current is definitely not the idle task #2a The #PF entry code on the guest went through enter_from_user_mode() which reactivates RCU #2b There is no preemptible, interrupts enabled code in the kernel which can run with RCU looking away. (The idle task is always non preemptible). I.e. all schedulable states (#1, #2a, #2b) do not need any of this RCU voodoo at all. In #2c RCU is eventually not watching, but as that state cannot schedule anyway there is no point to worry about it so it has to invoke rcu_irq_enter() before running that code. This can be optimized, but this will be done as an extra step in course of the entry code consolidation work. So the proper solution for this is to: - Split kvm_async_pf_task_wait() into schedule and halt based waiting interfaces which share the enqueueing code. - Add comments (condensed form of this changelog) to spare others the time waste and pain of reverse engineering all of this with the help of uncomprehensible changelogs and code history. - Invoke kvm_async_pf_task_wait_schedule() from kvm_handle_page_fault(), user mode and schedulable kernel side async page faults (#1, #2a, #2b) - Invoke kvm_async_pf_task_wait_halt() for the non schedulable kernel case (#2c). For this case also remove the rcu_irq_exit()/enter() pair around the halt as it is just a pointless exercise: - vCPUs can VMEXIT at any random point and can be scheduled out for an arbitrary amount of time by the host and this is not any different except that it voluntary triggers the exit via halt. - The interrupted context could have RCU watching already. So the rcu_irq_exit() before the halt is not gaining anything aside of confusing the reader. Claiming that this might prevent RCU stalls is just an illusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.262701431@linutronix.de
2020-05-19x86/kvm: Handle async page faults directly through do_page_fault()Andy Lutomirski1-18/+21
KVM overloads #PF to indicate two types of not-actually-page-fault events. Right now, the KVM guest code intercepts them by modifying the IDT and hooking the #PF vector. This makes the already fragile fault code even harder to understand, and it also pollutes call traces with async_page_fault and do_async_page_fault for normal page faults. Clean it up by moving the logic into do_page_fault() using a static branch. This gets rid of the platform trap_init override mechanism completely. [ tglx: Fixed up 32bit, removed error code from the async functions and massaged coding style ] Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200505134059.169270470@linutronix.de
2020-02-28KVM: Pre-allocate 1 cpumask variable per cpu for both pv tlb and pv ipisWanpeng Li1-12/+21
Nick Desaulniers Reported: When building with: $ make CC=clang arch/x86/ CFLAGS=-Wframe-larger-than=1000 The following warning is observed: arch/x86/kernel/kvm.c:494:13: warning: stack frame size of 1064 bytes in function 'kvm_send_ipi_mask_allbutself' [-Wframe-larger-than=] static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int vector) ^ Debugging with: https://github.com/ClangBuiltLinux/frame-larger-than via: $ python3 frame_larger_than.py arch/x86/kernel/kvm.o \ kvm_send_ipi_mask_allbutself points to the stack allocated `struct cpumask newmask` in `kvm_send_ipi_mask_allbutself`. The size of a `struct cpumask` is potentially large, as it's CONFIG_NR_CPUS divided by BITS_PER_LONG for the target architecture. CONFIG_NR_CPUS for X86_64 can be as high as 8192, making a single instance of a `struct cpumask` 1024 B. This patch fixes it by pre-allocate 1 cpumask variable per cpu and use it for both pv tlb and pv ipis.. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-28KVM: Introduce pv check helpersWanpeng Li1-10/+24
Introduce some pv check helpers for consistency. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-02-05x86/kvm: do not setup pv tlb flush when not paravirtualizedThadeu Lima de Souza Cascardo1-0/+3
kvm_setup_pv_tlb_flush will waste memory and print a misguiding message when KVM paravirtualization is not available. Intel SDM says that the when cpuid is used with EAX higher than the maximum supported value for basic of extended function, the data for the highest supported basic function will be returned. So, in some systems, kvm_arch_para_features will return bogus data, causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush. Testing for kvm_para_available will work as it checks for the hypervisor signature. Besides, when the "nopv" command line parameter is used, it should not continue as well, as kvm_guest_init will no be called in that case. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-01-07x86/context-tracking: Remove exception_enter/exit() from KVM_PV_REASON_PAGE_NOT_PRESENT async page faultFrederic Weisbecker1-4/+0
This is a leftover. Page faults, just like most other exceptions, are protected inside user_exit() / user_enter() calls in x86 entry code when we fault from userspace. So this pair of calls is now superfluous. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Link: https://lkml.kernel.org/r/20191227163612.10039-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-25x86/kvm: Fix -Wmissing-prototypes warningsYi Wang1-0/+1
We get two warning when build kernel with W=1: arch/x86/kernel/kvm.c:872:6: warning: no previous prototype for ‘arch_haltpoll_enable’ [-Wmissing-prototypes] arch/x86/kernel/kvm.c:885:6: warning: no previous prototype for ‘arch_haltpoll_disable’ [-Wmissing-prototypes] Including the missing head file can fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-12/+0
Pull KVM updates from Paolo Bonzini: "s390: - ioctl hardening - selftests ARM: - ITS translation cache - support for 512 vCPUs - various cleanups and bugfixes PPC: - various minor fixes and preparation x86: - bugfixes all over the place (posted interrupts, SVM, emulation corner cases, blocked INIT) - some IPI optimizations" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (75 commits) KVM: X86: Use IPI shorthands in kvm guest when support KVM: x86: Fix INIT signal handling in various CPU states KVM: VMX: Introduce exit reason for receiving INIT signal on guest-mode KVM: VMX: Stop the preemption timer during vCPU reset KVM: LAPIC: Micro optimize IPI latency kvm: Nested KVM MMUs need PAE root too KVM: x86: set ctxt->have_exception in x86_decode_insn() KVM: x86: always stop emulation on page fault KVM: nVMX: trace nested VM-Enter failures detected by H/W KVM: nVMX: add tracepoint for failed nested VM-Enter x86: KVM: svm: Fix a check in nested_svm_vmrun() KVM: x86: Return to userspace with internal error on unexpected exit reason KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers doc: kvm: Fix return description of KVM_SET_MSRS KVM: X86: Tune PLE Window tracepoint KVM: VMX: Change ple_window type to unsigned int KVM: X86: Remove tailing newline for tracepoints KVM: X86: Trace vcpu_id for vmexit KVM: x86: Manually calculate reserved bits when loading PDPTRS ...
2019-09-17Merge tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pmLinus Torvalds1-0/+37
Pull power management updates from Rafael Wysocki: "These include a rework of the main suspend-to-idle code flow (related to the handling of spurious wakeups), a switch over of several users of cpufreq notifiers to QoS-based limits, a new devfreq driver for Tegra20, a new cpuidle driver and governor for virtualized guests, an extension of the wakeup sources framework to expose wakeup sources as device objects in sysfs, and more. Specifics: - Rework the main suspend-to-idle control flow to avoid repeating "noirq" device resume and suspend operations in case of spurious wakeups from the ACPI EC and decouple the ACPI EC wakeups support from the LPS0 _DSM support (Rafael Wysocki). - Extend the wakeup sources framework to expose wakeup sources as device objects in sysfs (Tri Vo, Stephen Boyd). - Expose system suspend statistics in sysfs (Kalesh Singh). - Introduce a new haltpoll cpuidle driver and a new matching governor for virtualized guests wanting to do guest-side polling in the idle loop (Marcelo Tosatti, Joao Martins, Wanpeng Li, Stephen Rothwell). - Fix the menu and teo cpuidle governors to allow the scheduler tick to be stopped if PM QoS is used to limit the CPU idle state exit latency in some cases (Rafael Wysocki). - Increase the resolution of the play_idle() argument to microseconds for more fine-grained injection of CPU idle cycles (Daniel Lezcano). - Switch over some users of cpuidle notifiers to the new QoS-based frequency limits and drop the CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events (Viresh Kumar). - Add new cpufreq driver based on nvmem for sun50i (Yangtao Li). - Add support for MT8183 and MT8516 to the mediatek cpufreq driver (Andrew-sh.Cheng, Fabien Parent). - Add i.MX8MN support to the imx-cpufreq-dt cpufreq driver (Anson Huang). - Add qcs404 to cpufreq-dt-platdev blacklist (Jorge Ramirez-Ortiz). - Update the qcom cpufreq driver (among other things, to make it easier to extend and to use kryo cpufreq for other nvmem-based SoCs) and add qcs404 support to it (Niklas Cassel, Douglas RAILLARD, Sibi Sankar, Sricharan R). - Fix assorted issues and make assorted minor improvements in the cpufreq code (Colin Ian King, Douglas RAILLARD, Florian Fainelli, Gustavo Silva, Hariprasad Kelam). - Add new devfreq driver for NVidia Tegra20 (Dmitry Osipenko, Arnd Bergmann). - Add new Exynos PPMU events to devfreq events and extend that mechanism (Lukasz Luba). - Fix and clean up the exynos-bus devfreq driver (Kamil Konieczny). - Improve devfreq documentation and governor code, fix spelling typos in devfreq (Ezequiel Garcia, Krzysztof Kozlowski, Leonard Crestez, MyungJoo Ham, Gaël PORTAY). - Add regulators enable and disable to the OPP (operating performance points) framework (Kamil Konieczny). - Update the OPP framework to support multiple opp-suspend properties (Anson Huang). - Fix assorted issues and make assorted minor improvements in the OPP code (Niklas Cassel, Viresh Kumar, Yue Hu). - Clean up the generic power domains (genpd) framework (Ulf Hansson). - Clean up assorted pieces of power management code and documentation (Akinobu Mita, Amit Kucheria, Chuhong Yuan). - Update the pm-graph tool to version 5.5 including multiple fixes and improvements (Todd Brandt). - Update the cpupower utility (Benjamin Weis, Geert Uytterhoeven, Sébastien Szymanski)" * tag 'pm-5.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (126 commits) cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available cpuidle-haltpoll: do not set an owner to allow modunload cpuidle-haltpoll: return -ENODEV on modinit failure cpuidle-haltpoll: set haltpoll as preferred governor cpuidle: allow governor switch on cpuidle_register_driver() PM: runtime: Documentation: add runtime_status ABI document pm-graph: make setVal unbuffered again for python2 and python3 powercap: idle_inject: Use higher resolution for idle injection cpuidle: play_idle: Increase the resolution to usec cpuidle-haltpoll: vcpu hotplug support cpufreq: Add qcs404 to cpufreq-dt-platdev blacklist cpufreq: qcom: Add support for qcs404 on nvmem driver cpufreq: qcom: Refactor the driver to make it easier to extend cpufreq: qcom: Re-organise kryo cpufreq to use it for other nvmem based qcom socs dt-bindings: opp: Add qcom-opp bindings with properties needed for CPR dt-bindings: opp: qcom-nvmem: Support pstates provided by a power domain Documentation: cpufreq: Update policy notifier documentation cpufreq: Remove CPUFREQ_ADJUST and CPUFREQ_NOTIFY policy notifier events PM / Domains: Verify PM domain type in dev_pm_genpd_set_performance_state() PM / Domains: Simplify genpd_lookup_dev() ...
2019-09-17Merge branch 'pm-cpuidle'Rafael J. Wysocki1-0/+37
* pm-cpuidle: cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are available cpuidle-haltpoll: do not set an owner to allow modunload cpuidle-haltpoll: return -ENODEV on modinit failure cpuidle-haltpoll: set haltpoll as preferred governor cpuidle: allow governor switch on cpuidle_register_driver() powercap: idle_inject: Use higher resolution for idle injection cpuidle: play_idle: Increase the resolution to usec cpuidle-haltpoll: vcpu hotplug support cpuidle: teo: Get rid of redundant check in teo_update() cpuidle: teo: Allow tick to be stopped if PM QoS is used cpuidle: menu: Allow tick to be stopped if PM QoS is used cpuidle: header file stubs must be "static inline" cpuidle-haltpoll: disable host side polling when kvm virtualized cpuidle: add haltpoll governor governors: unify last_state_idx cpuidle: add poll_limit_ns to cpuidle_device structure add cpuidle-haltpoll driver
2019-09-16Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull scheduler updates from Ingo Molnar: - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann, Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers. As perf and the scheduler is getting bigger and more complex, document the status quo of current responsibilities and interests, and spread the review pain^H^H^H^H fun via an increase in the Cc: linecount generated by scripts/get_maintainer.pl. :-) - Add another series of patches that brings the -rt (PREEMPT_RT) tree closer to mainline: split the monolithic CONFIG_PREEMPT dependencies into a new CONFIG_PREEMPTION category that will allow the eventual introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches to go though. - Extend the CPU cgroup controller with uclamp.min and uclamp.max to allow the finer shaping of CPU bandwidth usage. - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS). - Improve the behavior of high CPU count, high thread count applications running under cpu.cfs_quota_us constraints. - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present. - Improve CPU isolation housekeeping CPU allocation NUMA locality. - Fix deadline scheduler bandwidth calculations and logic when cpusets rebuilds the topology, or when it gets deadline-throttled while it's being offlined. - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from setscheduler() system calls without creating global serialization. Add new synchronization between cpuset topology-changing events and the deadline acceptance tests in setscheduler(), which were broken before. - Rework the active_mm state machine to be less confusing and more optimal. - Rework (simplify) the pick_next_task() slowpath. - Improve load-balancing on AMD EPYC systems. - ... and misc cleanups, smaller fixes and improvements - please see the Git log for more details. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) sched/psi: Correct overly pessimistic size calculation sched/fair: Speed-up energy-aware wake-ups sched/uclamp: Always use 'enum uclamp_id' for clamp_id values sched/uclamp: Update CPU's refcount on TG's clamp changes sched/uclamp: Use TG's clamps to restrict TASK's clamps sched/uclamp: Propagate system defaults to the root group sched/uclamp: Propagate parent clamps sched/uclamp: Extend CPU's cgroup controller sched/topology: Improve load balancing on AMD EPYC systems arch, ia64: Make NUMA select SMP sched, perf: MAINTAINERS update, add submaintainers and reviewers sched/fair: Use rq_lock/unlock in online_fair_sched_group cpufreq: schedutil: fix equation in comment sched: Rework pick_next_task() slow-path sched: Allow put_prev_task() to drop rq->lock sched/fair: Expose newidle_balance() sched: Add task_struct pointer to sched_class::set_curr_task sched: Rework CPU hotplug task selection sched/{rt,deadline}: Fix set_next_task vs pick_next_task sched: Fix kerneldoc comment for ia64_set_curr_task ...
2019-09-14KVM: X86: Use IPI shorthands in kvm guest when supportWanpeng Li1-12/+0
IPI shorthand is supported now by linux apic/x2apic driver, switch to IPI shorthand for all excluding self and all including self destination shorthand in kvm guest, to avoid splitting the target mask into several PV IPI hypercalls. This patch removes the kvm_send_ipi_all() and kvm_send_ipi_allbutself() since the callers in APIC codes have already taken care of apic_use_ipi_shorthand and fallback to ->send_IPI_mask and ->send_IPI_mask_allbutself if it is false. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Nadav Amit <namit@vmware.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are availableWanpeng Li1-0/+1
The downside of guest side polling is that polling is performed even with other runnable tasks in the host. However, even if poll in kvm can aware whether or not other runnable tasks in the same pCPU, it can still incur extra overhead in over-subscribe scenario. Now we can just enable guest polling when dedicated pCPUs are available. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-09-03cpuidle-haltpoll: vcpu hotplug supportJoao Martins1-12/+6
When cpus != maxcpus cpuidle-haltpoll will fail to register all vcpus past the online ones and thus fail to register the idle driver. This is because cpuidle_add_sysfs() will return with -ENODEV as a consequence from get_cpu_device() return no device for a non-existing CPU. Instead switch to cpuidle_register_driver() and manually register each of the present cpus through cpuhp_setup_state() callbacks and future ones that get onlined or offlined. This mimmics similar logic that intel_idle does. Fixes: fa86ee90eb11 ("add cpuidle-haltpoll driver") Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>