linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2022-06-08	KVM: x86: do not set st->preempted when going back to user space	Paolo Bonzini	2	-14/+18
	Similar to the Xen path, only change the vCPU's reported state if the vCPU was actually preempted. The reason for KVM's behavior is that for example optimistic spinning might not be a good idea if the guest is doing repeated exits to userspace; however, it is confusing and unlikely to make a difference, because well-tuned guests will hardly ever exit KVM_RUN in the first place. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	KVM: SVM: fix tsc scaling cache logic	Maxim Levitsky	3	-15/+23
	SVM uses a per-cpu variable to cache the current value of the tsc scaling multiplier msr on each cpu. Commit 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") broke this caching logic. Refactor the code so that all TSC scaling multiplier writes go through a single function which checks and updates the cache. This fixes the following scenario: 1. A CPU runs a guest with some tsc scaling ratio. 2. New guest with different tsc scaling ratio starts on this CPU and terminates almost immediately. This ensures that the short running guest had set the tsc scaling ratio just once when it was set via KVM_SET_TSC_KHZ. Due to the bug, the per-cpu cache is not updated. 3. The original guest continues to run, it doesn't restore the msr value back to its own value, because the cache matches, and thus continues to run with a wrong tsc scaling ratio. Fixes: 1ab9287add5e2 ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	KVM: selftests: Make hyperv_clock selftest more stable	Vitaly Kuznetsov	1	-3/+7
	hyperv_clock doesn't always give a stable test result, especially with AMD CPUs. The test compares Hyper-V MSR clocksource (acquired either with rdmsr() from within the guest or KVM_GET_MSRS from the host) against rdtsc(). To increase the accuracy, increase the measured delay (done with nop loop) by two orders of magnitude and take the mean rdtsc() value before and after rdmsr()/KVM_GET_MSRS. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220601144322.1968742-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging	Ben Gardon	3	-6/+42
	Currently disabling dirty logging with the TDP MMU is extremely slow. On a 96 vCPU / 96G VM backed with gigabyte pages, it takes ~200 seconds to disable dirty logging with the TDP MMU, as opposed to ~4 seconds with the shadow MMU. When disabling dirty logging, zap non-leaf parent entries to allow replacement with huge pages instead of recursing and zapping all of the child, leaf entries. This reduces the number of TLB flushes required. and reduces the disable dirty log time with the TDP MMU to ~3 seconds. Opportunistically add a WARN() to catch GFNs that are mapped at a higher level than their max level. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20220525230904.1584480-1-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm()	Jan Beulich	1	-1/+1
	As noted (and fixed) a couple of times in the past, "=@cc<cond>" outputs and clobbering of "cc" don't work well together. The compiler appears to mean to reject such, but doesn't - in its upstream form - quite manage to yet for "cc". Furthermore two similar macros don't clobber "cc", and clobbering "cc" is pointless in asm()-s for x86 anyway - the compiler always assumes status flags to be clobbered there. Fixes: 989b5db215a2 ("x86/uaccess: Implement macros for CMPXCHG on user addresses") Signed-off-by: Jan Beulich <jbeulich@suse.com> Message-Id: <485c0c0b-a3a7-0b7c-5264-7d00c01de032@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()	Shaoqin Huang	1	-1/+1
	When freeing obsolete previous roots, check prev_roots as intended, not the current root. Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Fixes: 527d5cd7eece ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped") Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set	Seth Forshee	1	-6/+0
	A livepatch transition may stall indefinitely when a kvm vCPU is heavily loaded. To the host, the vCPU task is a user thread which is spending a very long time in the ioctl(KVM_RUN) syscall. During livepatch transition, set_notify_signal() will be called on such tasks to interrupt the syscall so that the task can be transitioned. This interrupts guest execution, but when xfer_to_guest_mode_work() sees that TIF_NOTIFY_SIGNAL is set but not TIF_SIGPENDING it concludes that an exit to user mode is unnecessary, and guest execution is resumed without transitioning the task for the livepatch. This handling of TIF_NOTIFY_SIGNAL is incorrect, as set_notify_signal() is expected to break tasks out of interruptible kernel loops and cause them to return to userspace. Change xfer_to_guest_mode_work() to handle TIF_NOTIFY_SIGNAL the same as TIF_SIGPENDING, signaling to the vCPU run loop that an exit to userpsace is needed. Any pending task_work will be run when get_signal() is called from exit_to_user_mode_loop(), so there is no longer any need to run task work from xfer_to_guest_mode_work(). Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Seth Forshee <sforshee@digitalocean.com> Message-Id: <20220504180840.2907296-1-sforshee@digitalocean.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-07	KVM: Don't null dereference ops->destroy	Alexey Kardashevskiy	1	-1/+4
	A KVM device cleanup happens in either of two callbacks: 1) destroy() which is called when the VM is being destroyed; 2) release() which is called when a device fd is closed. Most KVM devices use 1) but Book3s's interrupt controller KVM devices (XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during the machine execution. The error handling in kvm_ioctl_create_device() assumes destroy() is always defined which leads to NULL dereference as discovered by Syzkaller. This adds a checks for destroy!=NULL and adds a missing release(). This is not changing kvm_destroy_devices() as devices with defined release() should have been removed from the KVM devices list by then. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest	Yanfei Xu	1	-1/+1
	When kernel handles the vm-exit caused by external interrupts and NMI, it always sets kvm_intr_type to tell if it's dealing an IRQ or NMI. For the PMI scenario, it could be IRQ or NMI. However, intel_pt PMIs are only generated for HARDWARE perf events, and HARDWARE events are always configured to generate NMIs. Use kvm_handling_nmi_from_guest() to precisely identify if the intel_pt PMI came from the guest; this avoids false positives if an intel_pt PMI/NMI arrives while the host is handling an unrelated IRQ VM-Exit. Fixes: db215756ae59 ("KVM: x86: More precisely identify NMI from guest when handling PMI") Signed-off-by: Yanfei Xu <yanfei.xu@intel.com> Message-Id: <20220523140821.1345605-1-yanfei.xu@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: selftests: x86: Sync the new name of the test case to .gitignore	Like Xu	1	-1/+1
	Fixing side effect of the so-called opportunistic change in the commit. Fixes: dc8a9febbab0 ("KVM: selftests: x86: Fix test failure on arch lbr capable platforms") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20220518170118.66263-2-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND	Paolo Bonzini	1	-26/+26
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	x86, kvm: use correct GFP flags for preemption disabled	Paolo Bonzini	1	-1/+1
	Commit ddd7ed842627 ("x86/kvm: Alloc dummy async #PF token outside of raw spinlock") leads to the following Smatch static checker warning: arch/x86/kernel/kvm.c:212 kvm_async_pf_task_wake() warn: sleeping in atomic context arch/x86/kernel/kvm.c 202 raw_spin_lock(&b->lock); 203 n = _find_apf_task(b, token); 204 if (!n) { 205 /* 206 * Async #PF not yet handled, add a dummy entry for the token. 207 * Allocating the token must be down outside of the raw lock 208 * as the allocator is preemptible on PREEMPT_RT kernels. 209 / 210 if (!dummy) { 211 raw_spin_unlock(&b->lock); --> 212 dummy = kzalloc(sizeof(dummy), GFP_KERNEL); ^^^^^^^^^^ Smatch thinks the caller has preempt disabled. The `smdb.py preempt kvm_async_pf_task_wake` output call tree is: sysvec_kvm_asyncpf_interrupt() <- disables preempt -> __sysvec_kvm_asyncpf_interrupt() -> kvm_async_pf_task_wake() The caller is this: arch/x86/kernel/kvm.c 290 DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt) 291 { 292 struct pt_regs *old_regs = set_irq_regs(regs); 293 u32 token; 294 295 ack_APIC_irq(); 296 297 inc_irq_stat(irq_hv_callback_count); 298 299 if (__this_cpu_read(apf_reason.enabled)) { 300 token = __this_cpu_read(apf_reason.token); 301 kvm_async_pf_task_wake(token); 302 __this_cpu_write(apf_reason.token, 0); 303 wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1); 304 } 305 306 set_irq_regs(old_regs); 307 } The DEFINE_IDTENTRY_SYSVEC() is a wrapper that calls this function from the call_on_irqstack_cond(). It's inside the call_on_irqstack_cond() where preempt is disabled (unless it's already disabled). The irq_enter/exit_rcu() functions disable/enable preempt. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer	Wanpeng Li	1	-0/+1
	The timer is disarmed when switching between TSC deadline and other modes; however, the pending timer is still in-flight, so let's accurately remove any traces of the previous mode. Fixes: 4427593258 ("KVM: x86: thoroughly disarm LAPIC timer around TSC deadline switch") Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	x86/kvm: Alloc dummy async #PF token outside of raw spinlock	Sean Christopherson	1	-14/+27
	Drop the raw spinlock in kvm_async_pf_task_wake() before allocating the the dummy async #PF token, the allocator is preemptible on PREEMPT_RT kernels and must not be called from truly atomic contexts. Opportunistically document why it's ok to loop on allocation failure, i.e. why the function won't get stuck in an infinite loop. Reported-by: Yajun Deng <yajun.deng@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: x86: avoid calling x86 emulator without a decoded instruction	Sean Christopherson	1	-12/+19
	Whenever x86_decode_emulated_instruction() detects a breakpoint, it returns the value that kvm_vcpu_check_breakpoint() writes into its pass-by-reference second argument. Unfortunately this is completely bogus because the expected outcome of x86_decode_emulated_instruction is an EMULATION_* value. Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK and x86_emulate_instruction() is called without having decoded the instruction. This causes various havoc from running with a stale emulation context. The fix is to move the call to kvm_vcpu_check_breakpoint() where it was before commit 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") introduced x86_decode_emulated_instruction(). The other caller of the function does not need breakpoint checks, because it is invoked as part of a vmexit and the processor has already checked those before executing the instruction that #GP'd. This fixes CVE-2022-1852. Reported-by: Qiuhao Li <qiuhao@sysec.org> Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Fixes: 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311032801.3467418-2-seanjc@google.com> [Rewrote commit message according to Qiuhao's report, since a patch already existed to fix the bug. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak	Ashish Kalra	1	-6/+6
	For some sev ioctl interfaces, the length parameter that is passed maybe less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data that PSP firmware returns. In this case, kmalloc will allocate memory that is the size of the input rather than the size of the data. Since PSP firmware doesn't fully overwrite the allocated buffer, these sev ioctl interface may return uninitialized kernel slab memory. Reported-by: Andy Nguyen <theflow@google.com> Suggested-by: David Rientjes <rientjes@google.com> Suggested-by: Peter Gonda <pgonda@google.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: eaf78265a4ab3 ("KVM: SVM: Move SEV code to separate file") Fixes: 2c07ded06427d ("KVM: SVM: add support for SEV attestation command") Fixes: 4cfdd47d6d95a ("KVM: SVM: Add KVM_SEV SEND_START command") Fixes: d3d1af85e2c75 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Fixes: eba04b20e4861 ("KVM: x86: Account a variety of miscellaneous allocations") Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Reviewed-by: Peter Gonda <pgonda@google.com> Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave)	Sean Christopherson	1	-1/+16
	Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave', i.e. to KVM's historical uABI size. When saving FPU state for usersapce, KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if the host doesn't support XSAVE. Setting the XSAVE header allows the VM to be migrated to a host that does support XSAVE without the new host having to handle FPU state that may or may not be compatible with XSAVE. Setting the uABI size to the host's default size results in out-of-bounds writes (setting the FP+SSE bits) and data corruption (that is thankfully caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs. WARN if the default size is larger than KVM's historical uABI size; all features that can push the FPU size beyond the historical size must be opt-in. ================================================================== BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130 Read of size 8 at addr ffff888011e33a00 by task qemu-build/681 CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1 Hardware name: /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010 Call Trace: <TASK> dump_stack_lvl+0x34/0x45 print_report.cold+0x45/0x575 kasan_report+0x9b/0xd0 fpu_copy_uabi_to_guest_fpstate+0x86/0x130 kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm] kvm_vcpu_ioctl+0x47f/0x7b0 [kvm] __x64_sys_ioctl+0x5de/0xc90 do_syscall_64+0x31/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Allocated by task 0: (stack is not available) The buggy address belongs to the object at ffff888011e33800 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 0 bytes to the right of 512-byte region [ffff888011e33800, ffff888011e33a00) The buggy address belongs to the physical page: page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30 head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab\|head\|zone=1) raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint Fixes: be50b2065dfa ("kvm: x86: Add support for getting/setting expanded xstate buffer") Fixes: c60427dd50ba ("x86/fpu: Add uabi_size to guest_fpu") Reported-by: Zdenek Kaspar <zkaspar82@gmail.com> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Zdenek Kaspar <zkaspar82@gmail.com> Message-Id: <20220504001219.983513-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	s390/uv_uapi: depend on CONFIG_S390	Paolo Bonzini	1	-0/+1
	Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: selftests: x86: Fix test failure on arch lbr capable platforms	Yang Weijiang	2	-9/+11
	On Arch LBR capable platforms, LBR_FMT in perf capability msr is 0x3f, so the last format test will fail. Use a true invalid format(0x30) for the test if it's running on these platforms. Opportunistically change the file name to reflect the tests actually carried out. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Message-Id: <20220512084046.105479-1-weijiang.yang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25	KVM: LAPIC: Trace LAPIC timer expiration on every vmentry	Wanpeng Li	3	-11/+2
	In commit ec0671d5684a ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire was moved after guest_exit_irqoff() because invoking tracepoints within kvm_guest_enter/kvm_guest_exit caused a lockdep splat. These days this is not necessary, because commit 87fa7f3e98a1 ("x86/kvm: Move context tracking where it belongs", 2020-07-09) restricted the RCU extended quiescent state to be closer to vmentry/vmexit. Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate, because it will be reported even if vcpu_enter_guest causes multiple vmentries via the IPI/Timer fast paths, and it allows the removal of advance_expire_delta. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-20	KVM: s390: selftest: Test suppression indication on key prot exception	Janis Schoetterl-Glausch	1	-1/+45
	Check that suppression is not indicated on injection of a key checked protection exception caused by a memop after it already modified guest memory, as that violates the definition of suppression. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20220512131019.2594948-3-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-05-20	KVM: s390: Don't indicate suppression on dirtying, failing memop	Janis Schoetterl-Glausch	2	-4/+24
	If user space uses a memop to emulate an instruction and that memop fails, the execution of the instruction ends. Instruction execution can end in different ways, one of which is suppression, which requires that the instruction execute like a no-op. A writing memop that spans multiple pages and fails due to key protection may have modified guest memory, as a result, the likely correct ending is termination. Therefore, do not indicate a suppressing instruction ending in this case. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20220512131019.2594948-2-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-05-20	selftests: drivers/s390x: Add uvdevice tests	Steffen Eiden	6	-0/+302
	Adds some selftests to test ioctl error paths of the uv-uapi. The Kconfig S390_UV_UAPI must be selected and the Ultravisor facility must be available. The test can be executed by non-root, however, the uvdevice special file /dev/uv must be accessible for reading and writing which may imply root privileges. ./test-uv-device TAP version 13 1..6 # Starting 6 tests from 3 test cases. # RUN uvio_fixture.att.fault_ioctl_arg ... # OK uvio_fixture.att.fault_ioctl_arg ok 1 uvio_fixture.att.fault_ioctl_arg # RUN uvio_fixture.att.fault_uvio_arg ... # OK uvio_fixture.att.fault_uvio_arg ok 2 uvio_fixture.att.fault_uvio_arg # RUN uvio_fixture.att.inval_ioctl_cb ... # OK uvio_fixture.att.inval_ioctl_cb ok 3 uvio_fixture.att.inval_ioctl_cb # RUN uvio_fixture.att.inval_ioctl_cmd ... # OK uvio_fixture.att.inval_ioctl_cmd ok 4 uvio_fixture.att.inval_ioctl_cmd # RUN attest_fixture.att_inval_request ... # OK attest_fixture.att_inval_request ok 5 attest_fixture.att_inval_request # RUN attest_fixture.att_inval_addr ... # OK attest_fixture.att_inval_addr ok 6 attest_fixture.att_inval_addr # PASSED: 6 / 6 tests passed. # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Message-Id: <20220510144724.3321985-3-seiden@linux.ibm.com> Link: https://lore.kernel.org/kvm/20220510144724.3321985-3-seiden@linux.ibm.com/ Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
2022-05-20	drivers/s390/char: Add Ultravisor io device	Steffen Eiden	6	-1/+343
	This patch adds a new miscdevice to expose some Ultravisor functions to userspace. Userspace can send IOCTLs to the uvdevice that will then emit a corresponding Ultravisor Call and hands the result over to userspace. The uvdevice is available if the Ultravisor Call facility is present. Userspace can call the Retrieve Attestation Measurement Ultravisor Call using IOCTLs on the uvdevice. The uvdevice will do some sanity checks first. Then, copy the request data to kernel space, build the UVCB, perform the UV call, and copy the result back to userspace. Signed-off-by: Steffen Eiden <seiden@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/kvm/20220516113335.338212-1-seiden@linux.ibm.com/ Message-Id: <20220516113335.338212-1-seiden@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com> (whitespace and tristate fixes, pick)
2022-05-20	MAINTAINERS: Update KVM RISC-V entry to cover selftests support	Anup Patel	1	-0/+2
	We update KVM RISC-V maintainers entry to include appropriate KVM selftests directories so that RISC-V related KVM selftests patches are CC'ed to KVM RISC-V mailing list. Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Introduce ISA extension register	Atish Patra	2	-0/+119
	Currently, there is no provision for vmm (qemu-kvm or kvmtool) to query about multiple-letter ISA extensions. The config register is only used for base single letter ISA extensions. A new ISA extension register is added that will allow the vmm to query about any ISA extension one at a time. It is enabled for both single letter or multi-letter ISA extensions. The ISA extension register is useful to if the vmm requires to retrieve/set single extension while the config register should be used if all the base ISA extension required to retrieve or set. For any multi-letter ISA extensions, the new register interface must be used. Signed-off-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Cleanup stale TLB entries when host CPU changes	Anup Patel	3	-0/+39
	On RISC-V platforms with hardware VMID support, we share same VMID for all VCPUs of a particular Guest/VM. This means we might have stale G-stage TLB entries on the current Host CPU due to some other VCPU of the same Guest which ran previously on the current Host CPU. To cleanup stale TLB entries, we simply flush all G-stage TLB entries by VMID whenever underlying Host CPU changes for a VCPU. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Add remote HFENCE functions based on VCPU requests	Anup Patel	7	-53/+369
	The generic KVM has support for VCPU requests which can be used to do arch-specific work in the run-loop. We introduce remote HFENCE functions which will internally use VCPU requests instead of host SBI calls. Advantages of doing remote HFENCEs as VCPU requests are: 1) Multiple VCPUs of a Guest may be running on different Host CPUs so it is not always possible to determine the Host CPU mask for doing Host SBI call. For example, when VCPU X wants to do HFENCE on VCPU Y, it is possible that VCPU Y is blocked or in user-space (i.e. vcpu->cpu < 0). 2) To support nested virtualization, we will be having a separate shadow G-stage for each VCPU and a common host G-stage for the entire Guest/VM. The VCPU requests based remote HFENCEs helps us easily synchronize the common host G-stage and shadow G-stage of each VCPU without any additional IPI calls. This is also a preparatory patch for upcoming nested virtualization support where we will be having a shadow G-stage page table for each Guest VCPU. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Reduce KVM_MAX_VCPUS value	Anup Patel	1	-2/+1
	Currently, the KVM_MAX_VCPUS value is 16384 for RV64 and 128 for RV32. The KVM_MAX_VCPUS value is too high for RV64 and too low for RV32 compared to other architectures (e.g. x86 sets it to 1024 and ARM64 sets it to 512). The too high value of KVM_MAX_VCPUS on RV64 also leads to VCPU mask on stack consuming 2KB. We set KVM_MAX_VCPUS to 1024 for both RV64 and RV32 to be aligned other architectures. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Introduce range based local HFENCE functions	Anup Patel	6	-83/+237
	Various __kvm_riscv_hfence_xyz() functions implemented in the kvm/tlb.S are equivalent to corresponding HFENCE.GVMA instructions and we don't have range based local HFENCE functions. This patch provides complete set of local HFENCE functions which supports range based TLB invalidation and supports HFENCE.VVMA based functions. This is also a preparatory patch for upcoming Svinval support in KVM RISC-V. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Treat SBI HFENCE calls as NOPs	Anup Patel	1	-1/+5
	We should treat SBI HFENCE calls as NOPs until nested virtualization is supported by KVM RISC-V. This will help us test booting a hypervisor under KVM RISC-V. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Add Sv57x4 mode support for G-stage	Anup Patel	3	-1/+14
	Latest QEMU supports G-stage Sv57x4 mode so this patch extends KVM RISC-V G-stage handling to detect and use Sv57x4 mode when available. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	RISC-V: KVM: Use G-stage name for hypervisor page table	Anup Patel	7	-151/+151
	The two-stage address translation defined by the RISC-V privileged specification defines: VS-stage (guest virtual address to guest physical address) programmed by the Guest OS and G-stage (guest physical addree to host physical address) programmed by the hypervisor. To align with above terminology, we replace "stage2" with "gstage" and "Stage2" with "G-stage" name everywhere in KVM RISC-V sources. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	KVM: selftests: riscv: Remove unneeded semicolon	Jiapeng Chong	1	-1/+1
	Fix the following coccicheck warnings: ./tools/testing/selftests/kvm/lib/riscv/processor.c:353:3-4: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-20	KVM: selftests: riscv: Improve unexpected guest trap handling	Anup Patel	3	-17/+31
	Currently, we simply hang using "while (1) ;" upon any unexpected guest traps because the default guest trap handler is guest_hang(). The above approach is not useful to anyone because KVM selftests users will only see a hung application upon any unexpected guest trap. This patch improves unexpected guest trap handling for KVM RISC-V selftests by doing the following: 1) Return to host user-space 2) Dump VCPU registers 3) Die using TEST_ASSERT(0, ...) Signed-off-by: Anup Patel <apatel@ventanamicro.com> Tested-by: Mayuresh Chitale <mchitale@ventanamicro.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-05-16	KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run	Marc Zyngier	1	-1/+2
	We generally want to disallow hypercall bitmaps being changed once vcpus have already run. But we must allow the write if the written value is unchanged so that userspace can rewrite the register file on reboot, for example. Without this, a QEMU-based VM will fail to reboot correctly. The original code was correct, and it is me that introduced the regression. Fixes: 05714cab7d63 ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers") Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-16	KVM: arm64: vgic: Undo work in failed ITS restores	Ricardo Koller	1	-2/+13
	Failed ITS restores should clean up all state restored until the failure. There is some cleanup already present when failing to restore some tables, but it's not complete. Add the missing cleanup. Note that this changes the behavior in case of a failed restore of the device tables. restore ioctl: 1. restore collection tables 2. restore device tables With this commit, failures in 2. clean up everything created so far, including state created by 1. Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510001633.552496-5-ricarkol@google.com
2022-05-16	KVM: arm64: vgic: Do not ignore vgic_its_restore_cte failures	Ricardo Koller	1	-4/+23
	Restoring a corrupted collection entry (like an out of range ID) is being ignored and treated as success. More specifically, a vgic_its_restore_cte failure is treated as success by vgic_its_restore_collection_table. vgic_its_restore_cte uses positive and negative numbers to return error, and +1 to return success. The caller then uses "ret > 0" to check for success. Fix this by having vgic_its_restore_cte only return negative numbers on error. Do this by changing alloc_collection return codes to only return negative numbers on error. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510001633.552496-4-ricarkol@google.com
2022-05-16	KVM: arm64: vgic: Add more checks when restoring ITS tables	Ricardo Koller	1	-0/+7
	Try to improve the predictability of ITS save/restores (and debuggability of failed ITS saves) by failing early on restore when trying to read corrupted tables. Restoring the ITS tables does some checks for corrupted tables, but not as many as in a save: an overflowing device ID will be detected on save but not on restore. The consequence is that restoring a corrupted table won't be detected until the next save; including the ITS not working as expected after the restore. As an example, if the guest sets tables overlapping each other, which would most likely result in some corrupted table, this is what we would see from the host point of view: guest sets base addresses that overlap each other save ioctl restore ioctl save ioctl (fails) Ideally, we would like the first save to fail, but overlapping tables could actually be intended by the guest. So, let's at least fail on the restore with some checks: like checking that device and event IDs don't overflow their tables. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510001633.552496-3-ricarkol@google.com
2022-05-16	KVM: arm64: vgic: Check that new ITEs could be saved in guest memory	Ricardo Koller	1	-12/+35
	Try to improve the predictability of ITS save/restores by failing commands that would lead to failed saves. More specifically, fail any command that adds an entry into an ITS table that is not in guest memory, which would otherwise lead to a failed ITS save ioctl. There are already checks for collection and device entries, but not for ITEs. Add the corresponding check for the ITT when adding ITEs. Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510001633.552496-2-ricarkol@google.com
2022-05-16	KVM: arm64: pmu: Restore compilation when HW_PERF_EVENTS isn't selected	Marc Zyngier	5	-21/+31
	Moving kvm_pmu_events into the vcpu (and refering to it) broke the somewhat unusual case where the kernel has no support for a PMU at all. In order to solve this, move things around a bit so that we can easily avoid refering to the pmu structure outside of PMU-aware code. As a bonus, pmu.c isn't compiled in when HW_PERF_EVENTS isn't selected. Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/202205161814.KQHpOzsJ-lkp@intel.com
2022-05-15	Linux 5.18-rc7	Linus Torvalds	1	-1/+1

2022-05-15	KVM: arm64: Hide KVM_REG_ARM_*_BMAP_BIT_COUNT from userspace	Marc Zyngier	1	-0/+6
	These constants will change over time, and userspace has no business knowing about them. Hide them behind __KERNEL__. Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-05-15	KVM: arm64: Reenable pmu in Protected Mode	Fuad Tabba	1	-2/+1
	Now that the pmu code does not access hyp data, reenable it in protected mode. Once fully supported, protected VMs will not have pmu support, since that could leak information. However, non-protected VMs in protected mode should have pmu support if available. Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510095710.148178-5-tabba@google.com
2022-05-15	KVM: arm64: Pass pmu events to hyp via vcpu	Fuad Tabba	5	-28/+32
	Instead of the host accessing hyp data directly, pass the pmu events of the current cpu to hyp via the vcpu. This adds 64 bits (in two fields) to the vcpu that need to be synced before every vcpu run in nvhe and protected modes. However, it isolates the hypervisor from the host, which allows us to use pmu in protected mode in a subsequent patch. No visible side effects in behavior intended. Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510095710.148178-4-tabba@google.com
2022-05-15	KVM: arm64: Repack struct kvm_pmu to reduce size	Fuad Tabba	1	-2/+2
	struct kvm_pmu has 2 holes using 10 bytes. This is instantiated in all vcpus, so it adds up. Repack the structures to remove the holes. No functional change intended. Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510095710.148178-3-tabba@google.com
2022-05-15	KVM: arm64: Wrapper for getting pmu_events	Fuad Tabba	1	-16/+26
	Eases migrating away from using hyp data and simplifies the code. No functional change intended. Reviewed-by: Oliver Upton <oupton@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220510095710.148178-2-tabba@google.com
2022-05-15	KVM: arm64: vgic-v3: List M1 Pro/Max as requiring the SEIS workaround	Marc Zyngier	2	-0/+12
	Unsusprisingly, Apple M1 Pro/Max have the exact same defect as the original M1 and generate random SErrors in the host when a guest tickles the GICv3 CPU interface the wrong way. Add the part numbers for both the CPU types found in these two new implementations, and add them to the hall of shame. This also applies to the Ultra version, as it is composed of 2 Max SoCs. Signed-off-by: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20220514102524.3188730-1-maz@kernel.org
2022-05-13	gfs2: Stop using glock holder auto-demotion for now	Andreas Gruenbacher	1	-32/+14
	We're having unresolved issues with the glock holder auto-demotion mechanism introduced in commit dc732906c245. This mechanism was assumed to be essential for avoiding frequent short reads and writes until commit 296abc0d91d8 ("gfs2: No short reads or writes upon glock contention"). Since then, when the inode glock is lost, it is simply re-acquired and the operation is resumed. This means that apart from the performance penalty, we might as well drop the inode glock before faulting in pages, and re-acquire it afterwards. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-05-13	gfs2: buffered write prefaulting	Andreas Gruenbacher	1	-12/+16
	In gfs2_file_buffered_write, to increase the likelihood that all the user memory we're trying to write will be resident in memory, carry out the write in chunks and fault in each chunk of user memory before trying to write it. Otherwise, some workloads will trigger frequent short "internal" writes, causing filesystem blocks to be allocated and then partially deallocated again when writing into holes, which is wasteful and breaks reservations. Neither the chunked writes nor any of the short "internal" writes are user visible. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>