aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kvm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 266Thomas Gleixner27-327/+27
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation 51 franklin street fifth floor boston ma 02110 1301 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 67 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141333.953658117@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds8-115/+143
Pull KVM fixes from Paolo Bonzini: "Fixes for PPC and s390" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pages KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslots KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE device KVM: PPC: Book3S HV: XIVE: Fix the enforced limit on the vCPU identifier KVM: PPC: Book3S HV: XIVE: Do not test the EQ flag validity when resetting KVM: PPC: Book3S HV: XIVE: Clear file mapping when device is released KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup KVM: PPC: Book3S HV: Avoid touching arch.mmu_ready in XIVE release functions KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID kvm: fix compile on s390 part 2
2019-06-01Merge tag 'kvm-s390-master-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-masterPaolo Bonzini1-0/+3
KVM: s390: Fixes - fix compilation for !CONFIG_PCI - fix the output of KVM_CAP_MAX_VCPU_ID
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174Thomas Gleixner2-17/+2
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 655 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner1-6/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry()Suraj Jitindar Singh1-0/+1
The sprgs are a set of 4 general purpose sprs provided for software use. SPRG3 is special in that it can also be read from userspace. Thus it is used on linux to store the cpu and numa id of the process to speed up syscall access to this information. This register is overwritten with the guest value on kvm guest entry, and so needs to be restored on exit again. Thus restore the value on the guest exit path in kvmhv_p9_guest_entry(). Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9Paul Mackerras1-2/+5
Commit 3309bec85e60 ("KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest") moved calls to trace_hardirqs_{on,off} in the entry path used for HPT guests. Similar code exists in the new streamlined entry path used for radix guests on POWER9. This makes the same change there, so as to avoid lockdep warnings such as this: [ 228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) [ 228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270 [ 228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core [ 228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42 [ 228.686911] NIP: c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20 [ 228.686963] REGS: c000200cdb50f570 TRAP: 0700 Not tainted (5.2.0-rc1-xive+) [ 228.687001] MSR: 9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 48222222 XER: 20040000 [ 228.687060] CFAR: c000000000116db0 IRQMASK: 1 [ 228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e [ 228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f [ 228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000 [ 228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000 [ 228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900 [ 228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074 [ 228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000 [ 228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0 [ 228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270 [ 228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270 [ 228.687466] Call Trace: [ 228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable) [ 228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0 [ 228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100 [ 228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340 [ 228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv] [ 228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv] [ 228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [ 228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm] [ 228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70 [ 228.688109] Instruction dump: [ 228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87 [ 228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87 [ 228.688251] irq event stamp: 205 [ 228.688287] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688412] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 [ 228.688513] ---[ end trace eb16f6260022a812 ]--- [ 228.688548] possible reason: unannotated irqs-off. [ 228.688571] irq event stamp: 205 [ 228.688607] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688719] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Tested-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pagesCédric Le Goater1-1/+2
Under XIVE, the ESB pages of an interrupt are used for interrupt management (EOI) and triggering. They are made available to guests through a mapping of the XIVE KVM device. When a device is passed-through, the passthru_irq helpers, kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), clear the ESB pages of the guest IRQ number being mapped and let the VM fault handler repopulate with the correct page. The ESB pages are mapped at offset 4 (KVM_XIVE_ESB_PAGE_OFFSET) in the KVM device mapping. Unfortunately, this offset was not taken into account when clearing the pages. This lead to issues with the passthrough devices for which the interrupts were not functional under some guest configuration (tg3 and single CPU) or in any configuration (e1000e adapter). Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslotsCédric Le Goater1-0/+8
According to Documentation/virtual/kvm/locking.txt, the srcu read lock should be taken when accessing the memslots of the VM. The XIVE KVM device needs to do so when configuring the page of the OS event queue of vCPU for a given priority and when marking the same page dirty before migration. This avoids warnings such as : [ 208.224882] ============================= [ 208.224884] WARNING: suspicious RCU usage [ 208.224889] 5.2.0-rc2-xive+ #47 Not tainted [ 208.224890] ----------------------------- [ 208.224894] ../include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage! [ 208.224896] other info that might help us debug this: [ 208.224898] rcu_scheduler_active = 2, debug_locks = 1 [ 208.224901] no locks held by qemu-system-ppc/3923. [ 208.224902] stack backtrace: [ 208.224907] CPU: 64 PID: 3923 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc2-xive+ #47 [ 208.224909] Call Trace: [ 208.224918] [c000200cdd98fa30] [c000000000be1934] dump_stack+0xe8/0x164 (unreliable) [ 208.224924] [c000200cdd98fa80] [c0000000001aec80] lockdep_rcu_suspicious+0x110/0x180 [ 208.224935] [c000200cdd98fb00] [c0080000075933a0] gfn_to_memslot+0x1c8/0x200 [kvm] [ 208.224943] [c000200cdd98fb40] [c008000007599600] gfn_to_pfn+0x28/0x60 [kvm] [ 208.224951] [c000200cdd98fb70] [c008000007599658] gfn_to_page+0x20/0x40 [kvm] [ 208.224959] [c000200cdd98fb90] [c0080000075b495c] kvmppc_xive_native_set_attr+0x8b4/0x1480 [kvm] [ 208.224967] [c000200cdd98fca0] [c00800000759261c] kvm_device_ioctl_attr+0x64/0xb0 [kvm] [ 208.224974] [c000200cdd98fcf0] [c008000007592730] kvm_device_ioctl+0xc8/0x110 [kvm] [ 208.224979] [c000200cdd98fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 208.224981] [c000200cdd98fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 208.224984] [c000200cdd98fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 208.224988] [c000200cdd98fe20] [c00000000000b888] system_call+0x5c/0x70 legoater@boss01:~$ Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration") Fixes: e6714bd1671d ("KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interruptsCédric Le Goater1-2/+2
The passthrough interrupts are defined at the host level and their IRQ data should not be cleared unless specifically deconfigured (shutdown) by the host. They differ from the IPI interrupts which are allocated by the XIVE KVM device and reserved to the guest usage only. This fixes a host crash when destroying a VM in which a PCI adapter was passed-through. In this case, the interrupt is cleared and freed by the KVM device and then shutdown by vfio at the host level. [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00 [ 1007.360285] Faulting instruction address: 0xc00000000009da34 [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1] [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4 [ 1007.360454] NIP: c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0 [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300 Not tainted (5.1.0-gad7e7d0ef) [ 1007.360500] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002484 XER: 00000000 [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1 [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0 [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200 [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718 [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100 [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027 [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0 [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120 [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50 [ 1007.360732] Call Trace: [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable) [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0 [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0 [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420 [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0 [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350 [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0 [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110 [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0 [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170 [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0 [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90 [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70 [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0 [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0 [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00 [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100 [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0 [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430 [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74 [ 1007.361159] Instruction dump: [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020 [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022 Cc: stable@vger.kernel.org # v4.12+ Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-30KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE deviceCédric Le Goater3-17/+22
The XICS-on-XIVE KVM device needs to allocate XIVE event queues when a priority is used by the OS. This is referred as EQ provisioning and it is done under the hood when : 1. a CPU is hot-plugged in the VM 2. the "set-xive" is called at VM startup 3. sources are restored at VM restore The kvm->lock mutex is used to protect the different XIVE structures being modified but in some contexts, kvm->lock is taken under the vcpu->mutex which is not permitted by the KVM locking rules. Introduce a new mutex 'lock' for the KVM devices for them to synchronize accesses to the XIVE device structures. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: XIVE: Fix the enforced limit on the vCPU identifierCédric Le Goater1-1/+1
When a vCPU is connected to the KVM device, it is done using its vCPU identifier in the guest. Fix the enforced limit on the vCPU identifier by taking into account the SMT mode. Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: XIVE: Do not test the EQ flag validity when resettingCédric Le Goater1-18/+18
When a CPU is hot-unplugged, the EQ is deconfigured using a zero size and a zero address. In this case, there is no need to check the flag and queue size validity. Move the checks after the queue reset code section to fix CPU hot-unplug. Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: XIVE: Clear file mapping when device is releasedCédric Le Goater1-0/+8
Improve the release of the XIVE KVM device by clearing the file address_space, which is used to unmap the interrupt ESB pages when a device is passed-through. Suggested-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpuPaul Mackerras1-8/+1
Currently the HV KVM code takes the kvm->lock around calls to kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call kvm_for_each_vcpu() internally). However, that leads to a lock order inversion problem, because these are called in contexts where the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock according to Documentation/virtual/kvm/locking.txt. Hence there is a possibility of deadlock. To fix this, we simply don't take the kvm->lock mutex around these calls. This is safe because the implementations of kvm_for_each_vcpu() and kvm_get_vcpu_by_id() have been designed to be able to be called locklessly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token listPaul Mackerras2-8/+7
Currently the Book 3S KVM code uses kvm->lock to synchronize access to the kvm->arch.rtas_tokens list. Because this list is scanned inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held, taking kvm->lock cause a lock inversion problem, which could lead to a deadlock. To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests inside the vcpu mutexes, and use that instead of kvm->lock when accessing the rtas token list. This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free(). At this point we don't hold the new mutex, but that is OK because kvmppc_rtas_tokens_free() is only called when the whole VM is being destroyed, and at that point nothing can be looking up a token in the list. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setupPaul Mackerras2-26/+41
Currently the HV KVM code uses kvm->lock in conjunction with a flag, kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu execution until the MMU-related data structures are ready. However, this means that kvm->lock is being taken inside vcpu->mutex, which is contrary to Documentation/virtual/kvm/locking.txt and results in lockdep warnings. To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests inside the vcpu mutexes, and is taken in the places where kvm->lock was taken that are related to MMU setup. Additionally we take the new mutex in the vcpu creation code at the point where we are creating a new vcore, in order to provide mutual exclusion with kvmppc_update_lpcr() and ensure that an update to kvm->arch.lpcr doesn't get missed, which could otherwise lead to a stale vcore->lpcr value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-29KVM: PPC: Book3S HV: Avoid touching arch.mmu_ready in XIVE release functionsPaul Mackerras2-32/+24
Currently, kvmppc_xive_release() and kvmppc_xive_native_release() clear kvm->arch.mmu_ready and call kick_all_cpus_sync() as a way of ensuring that no vcpus are executing in the guest. However, future patches will change the mutex associated with kvm->arch.mmu_ready to a new mutex that nests inside the vcpu mutexes, making it difficult to continue to use this method. In fact, taking the vcpu mutex for a vcpu excludes execution of that vcpu, and we already take the vcpu mutex around the call to kvmppc_xive_[native_]cleanup_vcpu(). Once the cleanup function is done and we release the vcpu mutex, the vcpu can execute once again, but because we have cleared vcpu->arch.xive_vcpu, vcpu->arch.irq_type, vcpu->arch.xive_esc_vaddr and vcpu->arch.xive_esc_raddr, that vcpu will not be going into XIVE code any more. Thus, once we have cleaned up all of the vcpus, we are safe to clean up the rest of the XIVE state, and we don't need to use kvm->arch.mmu_ready to hold off vcpu execution. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-28KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_IDThomas Huth1-0/+3
KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-05-24treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114Thomas Gleixner1-13/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 8 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190523091650.663497195@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98Thomas Gleixner1-6/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your optional any later version of the license extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds13-269/+2069
Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ...
2019-05-15Merge tag 'kvm-ppc-next-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEADPaolo Bonzini13-281/+2090
PPC KVM update for 5.2 * Support for guests to access the new POWER9 XIVE interrupt controller hardware directly, reducing interrupt latency and overhead for guests. * In-kernel implementation of the H_PAGE_INIT hypercall. * Reduce memory usage of sparsely-populated IOMMU tables. * Several bug fixes. Second PPC KVM update for 5.2 * Fix a bug, fix a spelling mistake, remove some useless code.
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny2-3/+3
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"Colin Ian King1-1/+1
There is a spelling mistake in a pr_err message, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-14KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUsPaul Mackerras1-6/+0
Commit 70ea13f6e609 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads", 2019-04-29) aimed to make radix guests that are using the real-mode entry path load the LPID register and flush the TLB in the same place where those things are done for HPT guests. However, it omitted to remove a branch which branches around that code for radix guests. The result is that with indep_thread_mode = N, radix guests don't run correctly. (With indep_threads_mode = Y, which is the default, radix guests use a different entry path.) This removes the offending branch, and also the load and compare that the branch depends on, since the cr7 setting is now unused. Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Fixes: 70ea13f6e609 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-05-10Merge tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds3-58/+88
Pull powerpc updates from Michael Ellerman: "Slightly delayed due to the issue with printk() calling probe_kernel_read() interacting with our new user access prevention stuff, but all fixed now. The only out-of-area changes are the addition of a cpuhp_state, small additions to Documentation and MAINTAINERS updates. Highlights: - Support for Kernel Userspace Access/Execution Prevention (like SMAP/SMEP/PAN/PXN) on some 64-bit and 32-bit CPUs. This prevents the kernel from accidentally accessing userspace outside copy_to/from_user(), or ever executing userspace. - KASAN support on 32-bit. - Rework of where we map the kernel, vmalloc, etc. on 64-bit hash to use the same address ranges we use with the Radix MMU. - A rewrite into C of large parts of our idle handling code for 64-bit Book3S (ie. power8 & power9). - A fast path entry for syscalls on 32-bit CPUs, for a 12-17% speedup in the null_syscall benchmark. - On 64-bit bare metal we have support for recovering from errors with the time base (our clocksource), however if that fails currently we hang in __delay() and never crash. We now have support for detecting that case and short circuiting __delay() so we at least panic() and reboot. - Add support for optionally enabling the DAWR on Power9, which had to be disabled by default due to a hardware erratum. This has the effect of enabling hardware breakpoints for GDB, the downside is a badly behaved program could crash the machine by pointing the DAWR at cache inhibited memory. This is opt-in obviously. - xmon, our crash handler, gets support for a read only mode where operations that could change memory or otherwise disturb the system are disabled. Plus many clean-ups, reworks and minor fixes etc. Thanks to: Christophe Leroy, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Ben Hutchings, Bo YU, Breno Leitao, Cédric Le Goater, Christopher M. Riedl, Christoph Hellwig, Colin Ian King, David Gibson, Ganesh Goudar, Gautham R. Shenoy, George Spelvin, Greg Kroah-Hartman, Greg Kurz, Horia Geantă, Jagadeesh Pagadala, Joel Stanley, Joe Perches, Julia Lawall, Laurentiu Tudor, Laurent Vivier, Lukas Bulwahn, Madhavan Srinivasan, Mahesh Salgaonkar, Mathieu Malaterre, Michael Neuling, Mukesh Ojha, Nathan Fontenot, Nathan Lynch, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Peng Hao, Qian Cai, Ravi Bangoria, Rick Lindsley, Russell Currey, Sachin Sant, Stewart Smith, Sukadev Bhattiprolu, Thomas Huth, Tobin C. Harding, Tyrel Datwyler, Valentin Schneider, Wei Yongjun, Wen Yang, YueHaibing" * tag 'powerpc-5.2-1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (205 commits) powerpc/64s: Use early_mmu_has_feature() in set_kuap() powerpc/book3s/64: check for NULL pointer in pgd_alloc() powerpc/mm: Fix hugetlb page initialization ocxl: Fix return value check in afu_ioctl() powerpc/mm: fix section mismatch for setup_kup() powerpc/mm: fix redundant inclusion of pgtable-frag.o in Makefile powerpc/mm: Fix makefile for KASAN powerpc/kasan: add missing/lost Makefile selftests/powerpc: Add a signal fuzzer selftest powerpc/booke64: set RI in default MSR ocxl: Provide global MMIO accessors for external drivers ocxl: move event_fd handling to frontend ocxl: afu_irq only deals with IRQ IDs, not offsets ocxl: Allow external drivers to use OpenCAPI contexts ocxl: Create a clear delineation between ocxl backend & frontend ocxl: Don't pass pci_dev around ocxl: Split pci.c ocxl: Remove some unused exported symbols ocxl: Remove superfluous 'extern' from headers ocxl: read_pasid never returns an error, so make it void ...
2019-05-07Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linuxLinus Torvalds1-1/+0
Pull pidfd updates from Christian Brauner: "This patchset makes it possible to retrieve pidfds at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. After a thorough review from Oleg CLONE_PIDFD returns pidfds in the parent_tidptr argument. This means we can give back the associated pid and the pidfd at the same time. Access to process metadata information thus becomes rather trivial. As has been agreed, CLONE_PIDFD creates file descriptors based on anonymous inodes similar to the new mount api. They are made unconditional by this patchset as they are now needed by core kernel code (vfs, pidfd) even more than they already were before (timerfd, signalfd, io_uring, epoll etc.). The core patchset is rather small. The bulky looking changelist is caused by David's very simple changes to Kconfig to make anon inodes unconditional. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". To remove worries about missing metadata access this patchset comes with a sample/test program that illustrates how a combination of CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/<pid>. Further work based on this patchset has been done by Joel. His work makes pidfds pollable. It finished too late for this merge window. I would prefer to have it sitting in linux-next for a while and send it for inclusion during the 5.3 merge window" * tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: samples: show race-free pidfd metadata access signal: support CLONE_PIDFD with pidfd_send_signal clone: add CLONE_PIDFD Make anon_inodes unconditional
2019-04-30Merge branch 'topic/ppc-kvm' into nextMichael Ellerman2-57/+87
Merge our topic branch shared with KVM. In particular this includes the rewrite of the idle code into C.
2019-04-30powerpc/64s: Reimplement book3s idle code in CNicholas Piggin1-46/+72
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation speific HV idle code to the powernv platform code. Book3S assembly stubs are kept in common code and used only to save the stack frame and non-volatile GPRs before executing architected idle instructions, and restoring the stack and reloading GPRs then returning to C after waking from idle. The complex logic dealing with threads and subcores, locking, SPRs, HMIs, timebase resync, etc., is all done in C which makes it more maintainable. This is not a strict translation to C code, there are some significant differences: - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs, but saves and restores them itself. - The optimisation where EC=ESL=0 idle modes did not have to save GPRs or change MSR is restored, because it's now simple to do. ESL=1 sleeps that do not lose GPRs can use this optimization too. - KVM secondary entry and cede is now more of a call/return style rather than branchy. nap_state_lost is not required because KVM always returns via NVGPR restoring path. - KVM secondary wakeup from offline sequence is moved entirely into the offline wakeup, which avoids a hwsync in the normal idle wakeup path. Performance measured with context switch ping-pong on different threads or cores, is possibly improved a small amount, 1-3% depending on stop state and core vs thread test for shallow states. Deep states it's in the noise compared with other latencies. KVM improvements: - Idle sleepers now always return to caller rather than branch out to KVM first. - This allows optimisations like very fast return to caller when no state has been lost. - KVM no longer requires nap_state_lost because it controls NVGPR save/restore itself on the way in and out. - The heavy idle wakeup KVM request check can be moved out of the normal host idle code and into the not-performance-critical offline code. - KVM nap code now returns from where it is called, which makes the flow a bit easier to follow. Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Squash the KVM changes in] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2019-04-30KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device closePaul Mackerras1-0/+15
This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device is closed, KVM will not try to enable or disable any of the escalation interrupts for the VCPUs. We don't have to worry about races between clearing the pointers and use of the pointers by the XIVE context push/pull code, because the callers hold the vcpu->mutex, which is also taken by the KVM_RUN code. Therefore the vcpu cannot be entering or exiting the guest concurrently. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Prevent races when releasing devicePaul Mackerras2-16/+78
Now that we have the possibility of a XIVE or XICS-on-XIVE device being released while the VM is still running, we need to be careful about races and potential use-after-free bugs. Although the kvmppc_xive struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu structs are freed, and they are used extensively in both the XIVE native and XICS-on-XIVE code. There are various ways in which XIVE code gets invoked: - VCPU entry and exit, which do push and pull operations on the XIVE hardware - one_reg get and set functions (vcpu->mutex is held) - XICS hypercalls (but only inside guest execution, not from kvmppc_pseries_do_hcall) - device creation calls (kvm->lock is held) - device callbacks - get/set attribute, mmap, pagefault, release/destroy - set_mapped/clr_mapped calls (kvm->lock is held) - connect_vcpu calls - debugfs file read callbacks Inside a device release function, we know that userspace cannot have an open file descriptor referring to the device, nor can it have any mmapped regions from the device. Therefore the device callbacks are excluded, as are the connect_vcpu calls (since they need a fd for the device). Further, since the caller holds the kvm->lock mutex, no other device creation calls or set/clr_mapped calls can be executing concurrently. To exclude VCPU execution and XICS hypercalls, we temporarily set kvm->arch.mmu_ready to 0. This forces any VCPU task that is trying to enter the guest to take the kvm->lock mutex, which is held by the caller of the release function. Then, sending an IPI to all other CPUs forces any VCPU currently executing in the guest to exit. Finally, we take the vcpu->mutex for each VCPU around the process of cleaning up and freeing its XIVE data structures, in order to exclude any one_reg get/set calls. To exclude the debugfs read callbacks, we just need to ensure that debugfs_remove is called before freeing any data structures. Once it returns we know that no CPU can be executing the callbacks (for our kvmppc_xive instance). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' methodCédric Le Goater4-11/+96
When a P9 sPAPR VM boots, the CAS negotiation process determines which interrupt mode to use (XICS legacy or XIVE native) and invokes a machine reset to activate the chosen mode. We introduce 'release' methods for the XICS-on-XIVE and the XIVE native KVM devices which are called when the file descriptor of the device is closed after the TIMA and ESB pages have been unmapped. They perform the necessary cleanups : clear the vCPU interrupt presenters that could be attached and then destroy the device. The 'release' methods replace the 'destroy' methods as 'destroy' is not called anymore once 'release' is. Compatibility with older QEMU is nevertheless maintained. This is not considered as a safe operation as the vCPUs are still running and could be referencing the KVM device through their presenters. To protect the system from any breakage, the kvmppc_xive objects representing both KVM devices are now stored in an array under the VM. Allocation is performed on first usage and memory is freed only when the VM exits. [paulus@ozlabs.org - Moved freeing of xive structures to book3s.c, put it under #ifdef CONFIG_KVM_XICS.] Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation modeCédric Le Goater1-3/+4
Full support for the XIVE native exploitation mode is now available, advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM Hypervisor) is not yet available. XIVE should also have been activated which is default setting on POWER9 systems running a recent Linux kernel. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add passthrough supportCédric Le Goater3-0/+65
The KVM XICS-over-XIVE device and the proposed KVM XIVE native device implement an IRQ space for the guest using the generic IPI interrupts of the XIVE IC controller. These interrupts are allocated at the OPAL level and "mapped" into the guest IRQ number space in the range 0-0x1FFF. Interrupt management is performed in the XIVE way: using loads and stores on the addresses of the XIVE IPI interrupt ESB pages. Both KVM devices share the same internal structure caching information on the interrupts, among which the xive_irq_data struct containing the addresses of the IPI ESB pages and an extra one in case of pass-through. The later contains the addresses of the ESB pages of the underlying HW controller interrupts, PHB4 in all cases for now. A guest, when running in the XICS legacy interrupt mode, lets the KVM XICS-over-XIVE device "handle" interrupt management, that is to perform the loads and stores on the addresses of the ESB pages of the guest interrupts. However, when running in XIVE native exploitation mode, the KVM XIVE native device exposes the interrupt ESB pages to the guest and lets the guest perform directly the loads and stores. The VMA exposing the ESB pages make use of a custom VM fault handler which role is to populate the VMA with appropriate pages. When a fault occurs, the guest IRQ number is deduced from the offset, and the ESB pages of associated XIVE IPI interrupt are inserted in the VMA (using the internal structure caching information on the interrupts). Supporting device passthrough in the guest running in XIVE native exploitation mode adds some extra refinements because the ESB pages of a different HW controller (PHB4) need to be exposed to the guest along with the initial IPI ESB pages of the XIVE IC controller. But the overall mechanic is the same. When the device HW irqs are mapped into or unmapped from the guest IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), are called to record or clear the passthrough interrupt information and to perform the switch. The approach taken by this patch is to clear the ESB pages of the guest IRQ number being mapped and let the VM fault handler repopulate. The handler will insert the ESB page corresponding to the HW interrupt of the device being passed-through or the initial IPI ESB page if the device is being removed. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pagesCédric Le Goater1-0/+57
Each source is associated with an Event State Buffer (ESB) with a even/odd pair of pages which provides commands to manage the source: to trigger, to EOI, to turn off the source for instance. The custom VM fault handler will deduce the guest IRQ number from the offset of the fault, and the ESB page of the associated XIVE interrupt will be inserted into the VMA using the internal structure caching information on the interrupts. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a TIMA mappingCédric Le Goater1-0/+39
Each thread has an associated Thread Interrupt Management context composed of a set of registers. These registers let the thread handle priority management and interrupt acknowledgment. The most important are : - Interrupt Pending Buffer (IPB) - Current Processor Priority (CPPR) - Notification Source Register (NSR) They are exposed to software in four different pages each proposing a view with a different privilege. The first page is for the physical thread context and the second for the hypervisor. Only the third (operating system) and the fourth (user level) are exposed the guest. A custom VM fault handler will populate the VMA with the appropriate pages, which should only be the OS page for now. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE stateCédric Le Goater2-0/+100
The state of the thread interrupt management registers needs to be collected for migration. These registers are cached under the 'xive_saved_state.w01' field of the VCPU when the VPCU context is pulled from the HW thread. An OPAL call retrieves the backup of the IPB register in the underlying XIVE NVT structure and merges it in the KVM state. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pagesCédric Le Goater1-0/+85
When migration of a VM is initiated, a first copy of the RAM is transferred to the destination before the VM is stopped, but there is no guarantee that the EQ pages in which the event notifications are queued have not been modified. To make sure migration will capture a consistent memory state, the XIVE device should perform a XIVE quiesce sequence to stop the flow of event notifications and stabilize the EQs. This is the purpose of the KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty to force their transfer. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a control to sync the sourcesCédric Le Goater1-0/+36
This control will be used by the H_INT_SYNC hcall from QEMU to flush event notifications on the XIVE IC owning the source. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a global reset controlCédric Le Goater1-0/+85
This control is to be used by the H_INT_RESET hcall from QEMU. Its purpose is to clear all configuration of the sources and EQs. This is necessary in case of a kexec (for a kdump kernel for instance) to make sure that no remaining configuration is left from the previous boot setup so that the new kernel can start safely from a clean state. The queue 7 is ignored when the XIVE device is configured to run in single escalation mode. Prio 7 is used by escalations. The XIVE VP is kept enabled as the vCPU is still active and connected to the XIVE device. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configurationCédric Le Goater3-6/+260
These controls will be used by the H_INT_SET_QUEUE_CONFIG and H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying Event Queue in the XIVE IC. They will also be used to restore the configuration of the XIVE EQs and to capture the internal run-time state of the EQs. Both 'get' and 'set' rely on an OPAL call to access the EQ toggle bit and EQ index which are updated by the XIVE IC when event notifications are enqueued in the EQ. The value of the guest physical address of the event queue is saved in the XIVE internal xive_q structure for later use. That is when migration needs to mark the EQ pages dirty to capture a consistent memory state of the VM. To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra OPAL call setting the EQ toggle bit and EQ index to configure the EQ, but restoring the EQ state will. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Add a control to configure a sourceCédric Le Goater3-2/+104
This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from QEMU to configure the target of a source and also to restore the configuration of a source when migrating the VM. The XIVE source interrupt structure is extended with the value of the Effective Interrupt Source Number. The EISN is the interrupt number pushed in the event queue that the guest OS will use to dispatch events internally. Caching the EISN value in KVM eases the test when checking if a reconfiguration is indeed needed. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: add a control to initialize a sourceCédric Le Goater3-4/+120
The XIVE KVM device maintains a list of interrupt sources for the VM which are allocated in the pool of generic interrupts (IPIs) of the main XIVE IC controller. These are used for the CPU IPIs as well as for virtual device interrupts. The IRQ number space is defined by QEMU. The XIVE device reuses the source structures of the XICS-on-XIVE device for the source blocks (2-level tree) and for the source interrupts. Under XIVE native, the source interrupt caches mostly configuration information and is less used than under the XICS-on-XIVE device in which hcalls are still necessary at run-time. When a source is initialized in KVM, an IPI interrupt source is simply allocated at the OPAL level and then MASKED. KVM only needs to know about its type: LSI or MSI. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVECédric Le Goater4-41/+244
The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to let QEMU connect the vCPU presenters to the XIVE KVM device if required. The capability is not advertised for now as the full support for the XIVE native exploitation mode is not yet available. When this is case, the capability will be advertised on PowerNV Hypervisors only. Nested guests (pseries KVM Hypervisor) are not supported. Internally, the interface to the new KVM device is protected with a new interrupt mode: KVMPPC_IRQ_XIVE. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation modeCédric Le Goater3-2/+186
This is the basic framework for the new KVM device supporting the XIVE native exploitation mode. The user interface exposes a new KVM device to be created by QEMU, only available when running on a L0 hypervisor. Support for nested guests is not available yet. The XIVE device reuses the device structure of the XICS-on-XIVE device as they have a lot in common. That could possibly change in the future if the need arise. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras2-11/+15
This merges in the ppc-kvm topic branch from the powerpc tree to get patches which touch both general powerpc code and KVM code, one of which is a prerequisite for following patches. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry()Suraj Jitindar Singh1-0/+2
On POWER9 and later processors where the host can schedule vcpus on a per thread basis, there is a streamlined entry path used when the guest is radix. This entry path saves/restores the fp and vr state in kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and load_[fp/vr]_state(). This is the same as the old entry path however the old entry path also saved/restored the VRSAVE register, which isn't done in the new entry path. This means that the vrsave register is now volatile across guest exit, which is an incorrect change in behaviour. Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry(). This restores the old, correct, behaviour. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Flush TLB on secondary radix threadsPaul Mackerras3-62/+51
When running on POWER9 with kvm_hv.indep_threads_mode = N and the host in SMT1 mode, KVM will run guest VCPUs on offline secondary threads. If those guests are in radix mode, we fail to load the LPID and flush the TLB if necessary, leading to the guest crashing with an unsupported MMU fault. This arises from commit 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on", 2018-05-17), which didn't consider the case where indep_threads_mode = N. For simplicity, this makes the real-mode guest entry path flush the TLB in the same place for both radix and hash guests, as we did before 9a4506e11b97, though the code is now C code rather than assembly code. We also have the radix TLB flush open-coded rather than calling radix__local_flush_tlb_lpid_guest(), because the TLB flush can be called in real mode, and in real mode we don't want to invoke the tracepoint code. Fixes: 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-04-30KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C codePaul Mackerras2-34/+33
This replaces assembler code in book3s_hv_rmhandlers.S that checks the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush with C code in book3s_hv_builtin.c. Note that unlike the radix version, the hash version doesn't do an explicit ERAT invalidation because we will invalidate and load up the SLB before entering the guest, and that will invalidate the ERAT. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>