aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86/kernel
AgeCommit message (Collapse)AuthorFilesLines
2022-03-08x86/module: Fix the paravirt vs alternative orderPeter Zijlstra1-5/+8
Ever since commit 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") there is an ordering dependency between patching paravirt ops and patching alternatives, the module loader still violates this. Fixes: 4e6292114c74 ("x86/paravirt: Add new features for paravirt patching") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220303112825.068773913@infradead.org
2022-03-07Merge tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-52/+160
Pull x86 spectre fixes from Borislav Petkov: - Mitigate Spectre v2-type Branch History Buffer attacks on machines which support eIBRS, i.e., the hardware-assisted speculation restriction after it has been shown that such machines are vulnerable even with the hardware mitigation. - Do not use the default LFENCE-based Spectre v2 mitigation on AMD as it is insufficient to mitigate such attacks. Instead, switch to retpolines on all AMD by default. - Update the docs and add some warnings for the obviously vulnerable cmdline configurations. * tag 'x86_bugs_for_v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMT x86/speculation: Warn about Spectre v2 LFENCE mitigation x86/speculation: Update link to AMD speculation whitepaper x86/speculation: Use generic retpoline by default on AMD x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting Documentation/hw-vuln: Update spectre doc x86/speculation: Add eIBRS + Retpoline options x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE
2022-03-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-1/+6
Pull kvm fixes from Paolo Bonzini: "x86 guest: - Tweaks to the paravirtualization code, to avoid using them when they're pointless or harmful x86 host: - Fix for SRCU lockdep splat - Brown paper bag fix for the propagation of errno" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() KVM: x86: Yield to IPI target vCPU only if it is busy x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64 x86/kvm: Don't waste memory if kvmclock is disabled x86/kvm: Don't use PV TLB/yield when mwait is advertised
2022-03-05x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMTJosh Poimboeuf1-2/+25
The commit 44a3918c8245 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting") added a warning for the "eIBRS + unprivileged eBPF" combination, which has been shown to be vulnerable against Spectre v2 BHB-based attacks. However, there's no warning about the "eIBRS + LFENCE retpoline + unprivileged eBPF" combo. The LFENCE adds more protection by shortening the speculation window after a mispredicted branch. That makes an attack significantly more difficult, even with unprivileged eBPF. So at least for now the logic doesn't warn about that combination. But if you then add SMT into the mix, the SMT attack angle weakens the effectiveness of the LFENCE considerably. So extend the "eIBRS + unprivileged eBPF" warning to also include the "eIBRS + LFENCE + unprivileged eBPF + SMT" case. [ bp: Massage commit message. ] Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-03-05x86/speculation: Warn about Spectre v2 LFENCE mitigationJosh Poimboeuf1-0/+5
With: f8a66d608a3e ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd") it became possible to enable the LFENCE "retpoline" on Intel. However, Intel doesn't recommend it, as it has some weaknesses compared to retpoline. Now AMD doesn't recommend it either. It can still be left available as a cmdline option. It's faster than retpoline but is weaker in certain scenarios -- particularly SMT, but even non-SMT may be vulnerable in some cases. So just unconditionally warn if the user requests it on the cmdline. [ bp: Massage commit message. ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-4/+10
net/batman-adv/hard-interface.c commit 690bb6fb64f5 ("batman-adv: Request iflink once in batadv-on-batadv check") commit 6ee3c393eeb7 ("batman-adv: Demote batadv-on-batadv skip error message") https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/ net/smc/af_smc.c commit 4d08b7b57ece ("net/smc: Fix cleanup when register ULP fails") commit 462791bbfa35 ("net/smc: add sysctl interface for SMC") https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-28x86/speculation: Use generic retpoline by default on AMDKim Phillips1-9/+0
AMD retpoline may be susceptible to speculation. The speculation execution window for an incorrect indirect branch prediction using LFENCE/JMP sequence may potentially be large enough to allow exploitation using Spectre V2. By default, don't use retpoline,lfence on AMD. Instead, use the generic retpoline. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-02-28Merge 5.17-rc6 into driver-core-nextGreg Kroah-Hartman9-43/+22
We need the driver core fix in here as well for future changes. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-28Backmerge tag 'v5.17-rc6' into drm-nextDave Airlie5-20/+17
This backmerges v5.17-rc6 so I can merge some amdgpu and some tegra changes on top. Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-02-25KVM: x86: Yield to IPI target vCPU only if it is busyLi RongQing1-1/+1
When sending a call-function IPI-many to vCPUs, yield to the IPI target vCPU which is marked as preempted. but when emulating HLT, an idling vCPU will be voluntarily scheduled out and mark as preempted from the guest kernel perspective. yielding to idle vCPU is pointless and increase unnecessary vmexit, maybe miss the true preempted vCPU so yield to IPI target vCPU only if vCPU is busy and preempted Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-Id: <1644380201-29423-1-git-send-email-lirongqing@baidu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64Dexuan Cui1-1/+1
When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP but it's partially enlightened, i.e. cc_platform_has( CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false. Commit 4d96f9109109 per se is good, but with it now kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls set_memory_decrypted(), and later gets stuck when trying to zere out the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on Hyper-V. The cause is that here now the Linux VM should no longer access the original guest physical addrss (GPA); instead the VM should do memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary: see the example code in drivers/hv/connection.c: vmbus_connect() or drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to access the original GPA, it keepts getting injected a fault by Hyper-V and gets stuck there. Here the issue happens only when the VM has >=65 vCPUs, because the global static array hv_clock_boot[] can hold 64 "struct pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so kvmclock_init_mem() only allocates memory in the case of vCPUs > 64. Since the 'hvclock_mem' pages are only useful when the kvm clock is supported by the underlying hypervisor, fix the issue by returning early when Linux VM runs on Hyper-V, which doesn't support kvm clock. Fixes: 4d96f9109109 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()") Tested-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Dexuan Cui <decui@microsoft.com> Message-Id: <20220225084600.17817-1-decui@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25x86/kvm: Don't waste memory if kvmclock is disabledWanpeng Li1-0/+3
Even if "no-kvmclock" is passed in cmdline parameter, the guest kernel still allocates hvclock_mem which is scaled by the number of vCPUs, let's check kvmclock enable in advance to avoid this memory waste. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645520523-30814-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25x86/kvm: Don't use PV TLB/yield when mwait is advertisedWanpeng Li1-0/+2
MWAIT is advertised in host is not overcommitted scenario, however, PV TLB/sched yield should be enabled in host overcommitted scenario. Let's add the MWAIT checking when enabling PV TLB/sched yield. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25x86: remove __range_not_ok()Arnd Bergmann2-7/+1
The __range_not_ok() helper is an x86 (and sparc64) specific interface that does roughly the same thing as __access_ok(), but with different calling conventions. Change this to use the normal interface in order for consistency as we clean up all access_ok() implementations. This changes the limit from TASK_SIZE to TASK_SIZE_MAX, which Al points out is the right thing do do here anyway. The callers have to use __access_ok() instead of the normal access_ok() though, because on x86 that contains a WARN_ON_IN_IRQ() check that cannot be used inside of NMI context while tracing. The check in copy_code() is not needed any more, because this one is already done by copy_from_user_nmi(). Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/lkml/YgsUKcXGR7r4nINj@zeniv-ca.linux.org.uk/ Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2022-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski3-16/+7
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-4/+10
Pull kvm fixes from Paolo Bonzini: "x86 host: - Expose KVM_CAP_ENABLE_CAP since it is supported - Disable KVM_HC_CLOCK_PAIRING in TSC catchup mode - Ensure async page fault token is nonzero - Fix lockdep false negative - Fix FPU migration regression from the AMX changes x86 guest: - Don't use PV TLB/IPI/yield on uniprocessor guests PPC: - reserve capability id (topic branch for ppc/kvm)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled KVM: x86/mmu: make apf token non-zero to fix bug KVM: PPC: reserve capability 210 for KVM_CAP_PPC_AIL_MODE_3 x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPU x86/kvm: Fix compilation warning in non-x86_64 builds x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0 x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode KVM: Fix lockdep false negative during host resume KVM: x86: Add KVM_CAP_ENABLE_CAP to x86
2022-02-23x86/mm/cpa: Generalize __set_memory_enc_pgtable()Brijesh Singh1-2/+14
The kernel provides infrastructure to set or clear the encryption mask from the pages for AMD SEV, but TDX requires few tweaks. - TDX and SEV have different requirements to the cache and TLB flushing. - TDX has own routine to notify VMM about page encryption status change. Modify __set_memory_enc_pgtable() and make it flexible enough to cover both AMD SEV and Intel TDX. The AMD-specific behavior is isolated in the callbacks under x86_platform.guest. TDX will provide own version of said callbacks. [ bp: Beat into submission. ] Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/20220223043528.2093214-1-brijesh.singh@amd.com
2022-02-23x86/coco: Explicitly declare type of confidential computing platformKirill A. Shutemov1-0/+6
The kernel derives the confidential computing platform type it is running as from sme_me_mask on AMD or by using hv_is_isolation_supported() on HyperV isolation VMs. This detection process will be more complicated as more platforms get added. Declare a confidential computing vendor variable explicitly and set it via cc_set_vendor() on the respective platform. [ bp: Massage commit message, fixup HyperV check. ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20220222185740.26228-4-kirill.shutemov@linux.intel.com
2022-02-23x86/cc: Move arch/x86/{kernel/cc_platform.c => coco/core.c}Kirill A. Shutemov2-90/+0
Move cc_platform.c to arch/x86/coco/. The directory is going to be the home space for code related to confidential computing. Intel TDX code will land here. AMD SEV code will also eventually be moved there. No functional changes. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20220222185740.26228-3-kirill.shutemov@linux.intel.com
2022-02-23kernfs: move struct kernfs_root out of the public view.Greg Kroah-Hartman1-2/+2
There is no need to have struct kernfs_root be part of kernfs.h for the whole kernel to see and poke around it. Move it internal to kernfs code and provide a helper function, kernfs_root_to_node(), to handle the one field that kernfs users were directly accessing from the structure. Cc: Imran Khan <imran.f.khan@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220222070713.3517679-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-23x86/mce: Remove the tolerance level controlBorislav Petkov3-47/+30
This is pretty much unused and not really useful. What is more, all relevant MCA hardware has recoverable machine checks support so there's no real need to tweak MCA tolerance levels in order to *maybe* extend machine lifetime. So rip it out. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/YcDq8PxvKtTENl/e@zn.tnic
2022-02-21Merge tag 'v5.17-rc5' into sched/core, to resolve conflictsIngo Molnar7-39/+12
New conflicts in sched/core due to the following upstream fixes: 44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n") a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled") Conflicts: include/linux/psi_types.h kernel/sched/psi.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2022-02-21x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reportingJosh Poimboeuf1-6/+29
With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21x86/speculation: Add eIBRS + Retpoline optionsPeter Zijlstra1-37/+96
Thanks to the chaps at VUsec it is now clear that eIBRS is not sufficient, therefore allow enabling of retpolines along with eIBRS. Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and spectre_v2=eibrs,retpoline options to explicitly pick your preferred means of mitigation. Since there's new mitigations there's also user visible changes in /sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these new mitigations. [ bp: Massage commit message, trim error messages, do more precise eIBRS mode checking. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-21x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCEPeter Zijlstra (Intel)2-15/+22
The RETPOLINE_AMD name is unfortunate since it isn't necessarily AMD only, in fact Hygon also uses it. Furthermore it will likely be sufficient for some Intel processors. Therefore rename the thing to RETPOLINE_LFENCE to better describe what it is. Add the spectre_v2=retpoline,lfence option as an alias to spectre_v2=retpoline,amd to preserve existing setups. However, the output of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed. [ bp: Fix typos, massage. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2022-02-20Merge tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds3-16/+7
Pull x86 fixes from Borislav Petkov: - Fix the ptrace regset xfpregs_set() callback to behave according to the ABI - Handle poisoned pages properly in the SGX reclaimer code * tag 'x86_urgent_for_v5.17_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearing x86/sgx: Fix missing poison handling in reclaimer
2022-02-19x86/mce: Work around an erratum on fast string copy instructionsJue Wang2-1/+68
A rare kernel panic scenario can happen when the following conditions are met due to an erratum on fast string copy instructions: 1) An uncorrected error. 2) That error must be in first cache line of a page. 3) Kernel must execute page_copy from the page immediately before that page. The fast string copy instructions ("REP; MOVS*") could consume an uncorrectable memory error in the cache line _right after_ the desired region to copy and raise an MCE. Bit 0 of MSR_IA32_MISC_ENABLE can be cleared to disable fast string copy and will avoid such spurious machine checks. However, that is less preferable due to the permanent performance impact. Considering memory poison is rare, it's desirable to keep fast string copy enabled until an MCE is seen. Intel has confirmed the following: 1. The CPU erratum of fast string copy only applies to Skylake, Cascade Lake and Cooper Lake generations. Directly return from the MCE handler: 2. Will result in complete execution of the "REP; MOVS*" with no data loss or corruption. 3. Will not result in another MCE firing on the next poisoned cache line due to "REP; MOVS*". 4. Will resume execution from a correct point in code. 5. Will result in the same instruction that triggered the MCE firing a second MCE immediately for any other software recoverable data fetch errors. 6. Is not safe without disabling the fast string copy, as the next fast string copy of the same buffer on the same CPU would result in a PANIC MCE. This should mitigate the erratum completely with the only caveat that the fast string copy is disabled on the affected hyper thread thus performance degradation. This is still better than the OS crashing on MCEs raised on an irrelevant process due to "REP; MOVS*' accesses in a kernel context, e.g., copy_page. Tested: Injected errors on 1st cache line of 8 anonymous pages of process 'proc1' and observed MCE consumption from 'proc2' with no panic (directly returned). Without the fix, the host panicked within a few minutes on a random 'proc2' process due to kernel access from copy_page. [ bp: Fix comment style + touch ups, zap an unlikely(), improve the quirk function's readability. ] Signed-off-by: Jue Wang <juew@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20220218013209.2436006-1-juew@google.com
2022-02-18x86/ptrace: Fix xfpregs_set()'s incorrect xmm clearingAndy Lutomirski2-7/+6
xfpregs_set() handles 32-bit REGSET_XFP and 64-bit REGSET_FP. The actual code treats these regsets as modern FX state (i.e. the beginning part of XSTATE). The declarations of the regsets thought they were the legacy i387 format. The code thought they were the 32-bit (no xmm8..15) variant of XSTATE and, for good measure, made the high bits disappear by zeroing the wrong part of the buffer. The latter broke ptrace, and everything else confused anyone trying to understand the code. In particular, the nonsense definitions of the regsets confused me when I wrote this code. Clean this all up. Change the declarations to match reality (which shouldn't change the generated code, let alone the ABI) and fix xfpregs_set() to clear the correct bits and to only do so for 32-bit callers. Fixes: 6164331d15f7 ("x86/fpu: Rewrite xfpregs_set()") Reported-by: Luís Ferreira <contact@lsferreira.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215524 Link: https://lore.kernel.org/r/YgpFnZpF01WwR8wU@zn.tnic
2022-02-18x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPUWanpeng Li1-3/+6
Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-22/+3
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-17x86/sgx: Fix missing poison handling in reclaimerReinette Chatre1-9/+1
The SGX reclaimer code lacks page poison handling in its main free path. This can lead to avoidable machine checks if a poisoned page is freed and reallocated instead of being isolated. A troublesome scenario is: 1. Machine check (#MC) occurs (asynchronous, !MF_ACTION_REQUIRED) 2. arch_memory_failure() is eventually called 3. (SGX) page->poison set to 1 4. Page is reclaimed 5. Page added to normal free lists by sgx_reclaim_pages() ^ This is the bug (poison pages should be isolated on the sgx_poison_page_list instead) 6. Page is reallocated by some innocent enclave, a second (synchronous) in-kernel #MC is induced, probably during EADD instruction. ^ This is the fallout from the bug (6) is unfortunate and can be avoided by replacing the open coded enclave page freeing code in the reclaimer with sgx_free_epc_page() to obtain support for poison page handling that includes placing the poisoned page on the correct list. Fixes: d6d261bded8a ("x86/sgx: Add new sgx_epc_page flag bit to mark free pages") Fixes: 992801ae9243 ("x86/sgx: Initial poison handling for dirty and free pages") Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com
2022-02-17x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0Leonardo Bras1-1/+4
During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate(). When xsave feature is available, the fpu swap is done by: - xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used to store the current state of the fpu registers to a buffer. - xrstor(s) instruction, with (fpu_kernel_cfg.max_features & XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs. For xsave(s) the mask is used to limit what parts of the fpu regs will be copied to the buffer. Likewise on xrstor(s), the mask is used to limit what parts of the fpu regs will be changed. The mask for xsave(s), the guest's fpstate->xfeatures, is defined on kvm_arch_vcpu_create(), which (in summary) sets it to all features supported by the cpu which are enabled on kernel config. This means that xsave(s) will save to guest buffer all the fpu regs contents the cpu has enabled when the guest is paused, even if they are not used. This would not be an issue, if xrstor(s) would also do that. xrstor(s)'s mask for host/guest swap is basically every valid feature contained in kernel config, except XFEATURE_MASK_PKRU. Accordingto kernel src, it is instead switched in switch_to() and flush_thread(). Then, the following happens with a host supporting PKRU starts a guest that does not support it: 1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest, 2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU) 3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU) 4 - guest runs, then switch back to host, 5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU) 6 - xrstor(s) host fpstate to fpu regs. 7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with XFEATURE_MASK_PKRU, which should not be supported by guest vcpu) On 5, even though the guest does not support PKRU, it does have the flag set on guest fpstate, which is transferred to userspace via vcpu ioctl KVM_GET_XSAVE. This becomes a problem when the user decides on migrating the above guest to another machine that does not support PKRU: the new host restores guest's fpu regs to as they were before (xrstor(s)), but since the new host don't support PKRU, a general-protection exception ocurs in xrstor(s) and that crashes the guest. This can be solved by making the guest's fpstate->user_xfeatures hold a copy of guest_supported_xcr0. This way, on 7 the only flags copied to userspace will be the ones compatible to guest requirements, and thus there will be no issue during migration. As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2. Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures, there is not need to set it in kvm_check_cpuid(). So, change fpstate_realloc() so it does not touch fpstate->user_xfeatures if a non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid() calls it. Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220217053028.96432-2-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-16x86/cpu: Clear SME feature flag when not in useMario Limonciello2-1/+9
Currently, the SME CPU feature flag is reflective of whether the CPU supports the feature but not whether it has been activated by the kernel. Change this around to clear the SME feature flag if the kernel is not using it so userspace can determine if it is available and in use from /proc/cpuinfo. As the feature flag is cleared on systems where SME isn't active, use CPUID 0x8000001f to confirm SME availability before calling native_wbinvd(). Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20220216034446.2430634-1-mario.limonciello@amd.com
2022-02-16sched/isolation: Use single feature type while referring to housekeeping cpumaskFrederic Weisbecker1-3/+3
Refer to housekeeping APIs using single feature types instead of flags. This prevents from passing multiple isolation features at once to housekeeping interfaces, which soon won't be possible anymore as each isolation features will have their own cpumask. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220207155910.527133-5-frederic@kernel.org
2022-02-15x86/traps: Demand-populate PASID MSR via #GPFenghua Yu1-0/+55
All tasks start with PASID state disabled. This means that the first time they execute an ENQCMD instruction they will take a #GP fault. Modify the #GP fault handler to check if the "mm" for the task has already been allocated a PASID. If so, try to fix the #GP fault by loading the IA32_PASID MSR. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220207230254.3342514-9-fenghua.yu@intel.com
2022-02-15x86/fpu: Clear PASID when copying fpstateFenghua Yu1-0/+7
The kernel must allocate a Process Address Space ID (PASID) on behalf of each process which will use ENQCMD and program it into the new MSR to communicate the process identity to platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests from this process. The PASID state must be cleared on fork() since fork creates a new address space. For clone(), it would be functionally OK to copy the PASID. However, clearing it is _also_ functionally OK since any PASID use will trigger the #GP handler to populate the MSR. Copying the PASID state has two main downsides: * It requires differentiating fork() and clone() in the code, both in the FPU code and keeping tsk->pasid_activated consistent. * It guarantees that the PASID is out of its init state, which incurs small but non-zero cost on every XSAVE/XRSTOR. The main downside of clearing the PASID at fpstate copy is the future, one-time #GP for the thread. Use the simplest approach: clear the PASID state both on clone() and fork(). Rely on the #GP handler for MSR population in children. Also, just clear the PASID bit from xfeatures if XSAVE is supported. This will have no effect on systems that do not have PASID support. It is virtually zero overhead because 'dst_fpu' was just written and the whole thing is cache hot. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220207230254.3342514-7-fenghua.yu@intel.com
2022-02-14Backmerge tag 'v5.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-nextDave Airlie2-22/+3
Daniel asked for this for some intel deps, so let's do it now. Signed-off-by: Dave Airlie <airlied@redhat.com>
2022-02-13x86/mce: Use arch atomic and bit helpersBorislav Petkov3-42/+41
The arch helpers do not have explicit KASAN instrumentation. Use them in noinstr code. Inline a couple more functions with single call sites, while at it: mce_severity_amd_smca() has a single call-site which is noinstr so force the inlining and fix: vmlinux.o: warning: objtool: mce_severity_amd.constprop.0()+0xca: call to \ mce_severity_amd_smca() leaves .noinstr.text section Always inline mca_msr_reg(): text data bss dec hex filename 16065240 128031326 36405368 180501934 ac23dae vmlinux.before 16065240 128031294 36405368 180501902 ac23d8e vmlinux.after and mce_no_way_out() as the latter one is used only once, to fix: vmlinux.o: warning: objtool: mce_read_aux()+0x53: call to mca_msr_reg() leaves .noinstr.text section vmlinux.o: warning: objtool: do_machine_check()+0xc9: call to mce_no_way_out() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/r/20220204083015.17317-4-bp@alien8.de
2022-02-13Merge tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-0/+2
Pull x86 fix from Borislav Petkov: "Prevent softlockups when tearing down large SGX enclaves" * tag 'x86_urgent_for_v5.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Silence softlockup detection when releasing large enclaves
2022-02-12x86/head64: Add missing __head annotation to sme_postprocess_startup()Marco Bonelli1-1/+1
This function was previously part of __startup_64() which is marked __head, and is currently only called from there. Mark it __head too. Signed-off-by: Marco Bonelli <marco@mebeim.net> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220211162350.11780-1-marco@mebeim.net
2022-02-11Merge branch 'acpi-x86'Rafael J. Wysocki1-22/+1
Merge a revert of a problematic commit for 5.17-rc4. * acpi-x86: x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"
2022-02-10x86/sgx: Silence softlockup detection when releasing large enclavesReinette Chatre1-0/+2
Vijay reported that the "unclobbered_vdso_oversubscribed" selftest triggers the softlockup detector. Actual SGX systems have 128GB of enclave memory or more. The "unclobbered_vdso_oversubscribed" selftest creates one enclave which consumes all of the enclave memory on the system. Tearing down such a large enclave takes around a minute, most of it in the loop where the EREMOVE instruction is applied to each individual 4k enclave page. Spending one minute in a loop triggers the softlockup detector. Add a cond_resched() to give other tasks a chance to run and placate the softlockup detector. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> (kselftest as sanity check) Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com
2022-02-09Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski1-0/+34
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-02-09 We've added 126 non-merge commits during the last 16 day(s) which contain a total of 201 files changed, 4049 insertions(+), 2215 deletions(-). The main changes are: 1) Add custom BPF allocator for JITs that pack multiple programs into a huge page to reduce iTLB pressure, from Song Liu. 2) Add __user tagging support in vmlinux BTF and utilize it from BPF verifier when generating loads, from Yonghong Song. 3) Add per-socket fast path check guarding from cgroup/BPF overhead when used by only some sockets, from Pavel Begunkov. 4) Continued libbpf deprecation work of APIs/features and removal of their usage from samples, selftests, libbpf & bpftool, from Andrii Nakryiko and various others. 5) Improve BPF instruction set documentation by adding byte swap instructions and cleaning up load/store section, from Christoph Hellwig. 6) Switch BPF preload infra to light skeleton and remove libbpf dependency from it, from Alexei Starovoitov. 7) Fix architecture-agnostic macros in libbpf for accessing syscall arguments from BPF progs for non-x86 architectures, from Ilya Leoshkevich. 8) Rework port members in struct bpf_sk_lookup and struct bpf_sock to be of 16-bit field with anonymous zero padding, from Jakub Sitnicki. 9) Add new bpf_copy_from_user_task() helper to read memory from a different task than current. Add ability to create sleepable BPF iterator progs, from Kenny Yu. 10) Implement XSK batching for ice's zero-copy driver used by AF_XDP and utilize TX batching API from XSK buffer pool, from Maciej Fijalkowski. 11) Generate temporary netns names for BPF selftests to avoid naming collisions, from Hangbin Liu. 12) Implement bpf_core_types_are_compat() with limited recursion for in-kernel usage, from Matteo Croce. 13) Simplify pahole version detection and finally enable CONFIG_DEBUG_INFO_DWARF5 to be selected with CONFIG_DEBUG_INFO_BTF, from Nathan Chancellor. 14) Misc minor fixes to libbpf and selftests from various folks. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (126 commits) selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide libbpf: Fix compilation warning due to mismatched printf format selftests/bpf: Test BPF_KPROBE_SYSCALL macro libbpf: Add BPF_KPROBE_SYSCALL macro libbpf: Fix accessing the first syscall argument on s390 libbpf: Fix accessing the first syscall argument on arm64 libbpf: Allow overriding PT_REGS_PARM1{_CORE}_SYSCALL selftests/bpf: Skip test_bpf_syscall_macro's syscall_arg1 on arm64 and s390 libbpf: Fix accessing syscall arguments on riscv libbpf: Fix riscv register names libbpf: Fix accessing syscall arguments on powerpc selftests/bpf: Use PT_REGS_SYSCALL_REGS in bpf_syscall_macro libbpf: Add PT_REGS_SYSCALL_REGS macro selftests/bpf: Fix an endianness issue in bpf_syscall_macro test bpf: Fix bpf_prog_pack build HPAGE_PMD_SIZE bpf: Fix leftover header->pages in sparc and powerpc code. libbpf: Fix signedness bug in btf_dump_array_data() selftests/bpf: Do not export subtest as standalone test bpf, x86_64: Fail gracefully on bpf_jit_binary_pack_finalize failures ... ==================== Link: https://lore.kernel.org/r/20220209210050.8425-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09x86/PCI: revert "Ignore E820 reservations for bridge windows on newer systems"Hans de Goede1-22/+1
Commit 7f7b4236f204 ("x86/PCI: Ignore E820 reservations for bridge windows on newer systems") fixes the touchpad not working on laptops like the Lenovo IdeaPad 3 15IIL05 and the Lenovo IdeaPad 5 14IIL05, as well as fixing thunderbolt hotplug issues on the Lenovo Yoga C940. Unfortunately it turns out that this is causing issues with suspend/resume on Lenovo ThinkPad X1 Carbon Gen 2 laptops. So, per the no regressions policy, rever this. Note I'm looking into another fix for the issues this fixed. Fixes: 7f7b4236f204 ("x86/PCI: Ignore E820 reservations for bridge windows on newer systems") BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=2029207 Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-02-07x86/alternative: Introduce text_poke_copySong Liu1-0/+34
This will be used by BPF jit compiler to dump JITed binary to a RX huge page, and thus allow multiple BPF programs sharing the a huge (2MB) page. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/bpf/20220204185742.271030-6-song@kernel.org
2022-02-01x86/cpu: Read/save PPIN MSR during initializationTony Luck2-6/+5
Currently, the PPIN (Protected Processor Inventory Number) MSR is read by every CPU that processes a machine check, CMCI, or just polls machine check banks from a periodic timer. This is not a "fast" MSR, so this adds to overhead of processing errors. Add a new "ppin" field to the cpuinfo_x86 structure. Read and save the PPIN during initialization. Use this copy in mce_setup() instead of reading the MSR. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-4-tony.luck@intel.com
2022-02-01x86/cpu: X86_FEATURE_INTEL_PPIN finally has a CPUID bitTony Luck2-0/+2
After nine generations of adding to model specific list of CPUs that support PPIN (Protected Processor Inventory Number) Intel allocated a CPUID bit to enumerate the MSRs. CPUID(EAX=7, ECX=1).EBX bit 0 enumerates presence of MSR_PPIN_CTL and MSR_PPIN. Add it to the "scattered" CPUID bits and add an entry to the ppin_cpuids[] x86_match_cpu() array to catch Intel CPUs that implement it. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-3-tony.luck@intel.com
2022-02-01x86/cpu: Merge Intel and AMD ppin_init() functionsTony Luck3-72/+74
The code to decide whether a system supports the PPIN (Protected Processor Inventory Number) MSR was cloned from the Intel implementation. Apart from the X86_FEATURE bit and the MSR numbers it is identical. Merge the two functions into common x86 code, but use x86_match_cpu() instead of the switch (c->x86_model) that was used by the old Intel code. No functional change. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220131230111.2004669-2-tony.luck@intel.com
2022-02-01x86/CPU/AMD: Use default_groups in kobj_typeGreg Kroah-Hartman1-3/+4
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the AMD mce sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that the obsolete default_attrs field can be removed soon. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20220106103537.3663852-1-gregkh@linuxfoundation.org
2022-01-31Merge drm/drm-next into drm-intel-nextRodrigo Vivi60-862/+1116
Catch-up with 5.17-rc2 and trying to align with drm-intel-gt-next for a possible topic branch for merging the split of i915_regs... Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>