aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2024-12-20KVM: arm64: Introduce the EL1 pKVM MMUQuentin Perret4-9/+242
Introduce a set of helper functions allowing to manipulate the pKVM guest stage-2 page-tables from EL1 using pKVM's HVC interface. Each helper has an exact one-to-one correspondance with the traditional kvm_pgtable_stage2_*() functions from pgtable.c, with a strictly matching prototype. This will ease plumbing later on in mmu.c. These callbacks track the gfn->pfn mappings in a simple rb_tree indexed by IPA in lieu of a page-table. This rb-tree is kept in sync with pKVM's state and is protected by the mmu_lock like a traditional stage-2 page-table. Signed-off-by: Quentin Perret <qperret@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-18-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_tlb_flush_vmid()Quentin Perret2-0/+18
Introduce a new hypercall to flush the TLBs of non-protected guests. The host kernel will be responsible for issuing this hypercall after changing stage-2 permissions using the __pkvm_host_relax_guest_perms() or __pkvm_host_wrprotect_guest() paths. This is left under the host's responsibility for performance reasons. Note however that the TLB maintenance for all *unmap* operations still remains entirely under the hypervisor's responsibility for security reasons -- an unmapped page may be donated to another entity, so a stale TLB entry could be used to leak private data. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-17-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_mkyoung_guest()Quentin Perret4-0/+41
Plumb the kvm_pgtable_stage2_mkyoung() callback into pKVM for non-protected guests. It will be called later from the fault handling path. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-16-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_test_clear_young_guest()Quentin Perret4-0/+43
Plumb the kvm_stage2_test_clear_young() callback into pKVM for non-protected guest. It will be later be called from MMU notifiers. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-15-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_wrprotect_guest()Quentin Perret4-0/+42
Introduce a new hypercall to remove the write permission from a non-protected guest stage-2 mapping. This will be used for e.g. enabling dirty logging. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-14-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_relax_guest_perms()Quentin Perret4-0/+45
Introduce a new hypercall allowing the host to relax the stage-2 permissions of mappings in a non-protected guest page-table. It will be used later once we start allowing RO memslots and dirty logging. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-13-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_unshare_guest()Quentin Perret6-0/+108
In preparation for letting the host unmap pages from non-protected guests, introduce a new hypercall implementing the host-unshare-guest transition. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-12-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_host_share_guest()Quentin Perret7-1/+123
In preparation for handling guest stage-2 mappings at EL2, introduce a new pKVM hypercall allowing to share pages with non-protected guests. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-11-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Introduce __pkvm_vcpu_{load,put}()Marc Zyngier6-12/+93
Rather than look-up the hyp vCPU on every run hypercall at EL2, introduce a per-CPU 'loaded_hyp_vcpu' tracking variable which is updated by a pair of load/put hypercalls called directly from kvm_arch_vcpu_{load,put}() when pKVM is enabled. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-10-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Add {get,put}_pkvm_hyp_vm() helpersQuentin Perret2-0/+23
In preparation for accessing pkvm_hyp_vm structures at EL2 in a context where we can't always expect a vCPU to be loaded (e.g. MMU notifiers), introduce get/put helpers to get temporary references to hyp VMs from any context. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-9-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Make kvm_pgtable_stage2_init() a static inline functionQuentin Perret1-2/+5
Turn kvm_pgtable_stage2_init() into a static inline function instead of a macro. This will allow the usage of typeof() on it later on. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-8-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Pass walk flags to kvm_pgtable_stage2_relax_permsQuentin Perret3-9/+8
kvm_pgtable_stage2_relax_perms currently assumes that it is being called from a 'shared' walker, which will not be true once called from pKVM. To allow for the re-use of that function, make the walk flags one of its parameters. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-7-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Pass walk flags to kvm_pgtable_stage2_mkyoungQuentin Perret3-6/+8
kvm_pgtable_stage2_mkyoung currently assumes that it is being called from a 'shared' walker, which will not be true once called from pKVM. To allow for the re-use of that function, make the walk flags one of its parameters. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-6-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Move host page ownership tracking to the hyp vmemmapQuentin Perret3-37/+84
We currently store part of the page-tracking state in PTE software bits for the host, guests and the hypervisor. This is sub-optimal when e.g. sharing pages as this forces to break block mappings purely to support this software tracking. This causes an unnecessarily fragmented stage-2 page-table for the host in particular when it shares pages with Secure, which can lead to measurable regressions. Moreover, having this state stored in the page-table forces us to do multiple costly walks on the page transition path, hence causing overhead. In order to work around these problems, move the host-side page-tracking logic from SW bits in its stage-2 PTEs to the hypervisor's vmemmap. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-5-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Make hyp_page::order a u8Quentin Perret3-12/+13
We don't need 16 bits to store the hyp page order, and we'll need some bits to store page ownership data soon, so let's reduce the order member. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-4-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Move enum pkvm_page_state to memory.hQuentin Perret2-33/+34
In order to prepare the way for storing page-tracking information in pKVM's vmemmap, move the enum pkvm_page_state definition to nvhe/memory.h. No functional changes intended. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-3-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Change the layout of enum pkvm_page_stateQuentin Perret1-7/+9
The 'concrete' (a.k.a non-meta) page states are currently encoded using software bits in PTEs. For performance reasons, the abstract pkvm_page_state enum uses the same bits to encode these states as that makes conversions from and to PTEs easy. In order to prepare the ground for moving the 'concrete' state storage to the hyp vmemmap, re-arrange the enum to use bits 0 and 1 for this purpose. No functional changes intended. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Link: https://lore.kernel.org/r/20241218194059.3670226-2-qperret@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Promote guest ownership for DBGxVR/DBGxCR readsOliver Upton1-2/+1
Only yielding control of the debug registers for writes is a bit silly, unless of course you're a fan of pointless traps. Give control of the debug registers to the guest upon the first access, regardless of direction. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-20-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Fold DBGxVR/DBGxCR accessors into common setOliver Upton1-128/+69
There is a nauseating amount of boilerplate for accessing the breakpoint and watchpoint registers. Fold everything together into a single set of accessors and select the right storage based on the sysreg encoding. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-19-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Avoid reading ID_AA64DFR0_EL1 for debug save/restoreOliver Upton3-13/+11
Similar to other per-CPU profiling/debug features we handle, store the number of breakpoints/watchpoints in kvm_host_data to avoid reading the ID register 4 times on every guest entry/exit. And if you're in the nested virt business that's quite a few avoidable exits to the L0 hypervisor. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-18-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: nv: Honor MDCR_EL2.TDE routing for debug exceptionsOliver Upton3-4/+23
Inject debug exceptions into vEL2 if MDCR_EL2.TDE is set. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-17-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Manage software step state at load/putMarc Zyngier5-129/+48
KVM takes over the guest's software step state machine if the VMM is debugging the guest, but it does the save/restore fiddling for every guest entry. Note that the only constraint on host usage of software step is that the guest's configuration remains visible to userspace via the ONE_REG ioctls. So, we can cut down on the amount of fiddling by doing this at load/put instead. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-16-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Don't hijack guest context MDSCR_EL1Oliver Upton3-52/+64
Stealing MDSCR_EL1 in the guest's kvm_cpu_context for external debugging is rather gross. Just add a field for this instead and let the context switch code pick the correct one based on the debug owner. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-15-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Compute MDCR_EL2 at vcpu_load()Oliver Upton3-19/+3
KVM has picked up several hacks to cope with vcpu->arch.mdcr_el2 needing to be prepared before vcpu_load(), which is when it gets programmed into hardware on VHE. Now that the flows for reprogramming MDCR_EL2 have been simplified, move that computation to vcpu_load(). Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-14-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Reload vCPU for accesses to OSLAR_EL1Oliver Upton3-8/+15
KVM takes ownership of the debug regs if the guest enables the OS lock, as it needs to use MDSCR_EL1 to mask debug exceptions. Just reload the vCPU if the guest toggles the OS lock, relying on kvm_vcpu_load_debug() to update the debug owner and get the right trap configuration in place. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-13-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Use debug_owner to track if debug regs need save/restoreOliver Upton5-55/+7
Use the debug owner to determine if the debug regs are in use instead of keeping around the DEBUG_DIRTY flag. Debug registers are now saved/restored after the first trap, regardless of whether it was a read or a write. This also shifts the point at which KVM becomes lazy to vcpu_put() rather than the next exception taken from the guest. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-12-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Remove vestiges of debug_ptrOliver Upton4-38/+1
Delete the remnants of debug_ptr now that debug registers are selected based on the debug owner instead. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-11-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Remove debug tracepointsOliver Upton3-120/+0
The debug tracepoints are a useless firehose of information that track implementation detail rather than well-defined events. These are going to be rather difficult to uphold now that the implementation is getting redone, so throw them out instead of bending over backwards. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-10-oliver.upton@linux.dev [maz: fixed compilation after trace-ectomy] Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Select debug state to save/restore based on debug ownerOliver Upton2-4/+21
Select the set of debug registers to use based on the owner rather than relying on debug_ptr. Besides the code cleanup, this allows us to eliminate a couple instances kern_hyp_va() as well. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-9-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Clean up KVM_SET_GUEST_DEBUG handlerOliver Upton1-18/+11
No particular reason other than it isn't nice to look at. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-8-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Evaluate debug owner at vcpu_load()Oliver Upton4-0/+60
In preparation for tossing the debug_ptr mess, introduce an enumeration to track the ownership of the debug registers while in the guest. Update the owner at vcpu_load() based on whether the host needs to steal the guest's debug context or if breakpoints/watchpoints are actively in use. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-7-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Write MDCR_EL2 directly from kvm_arm_setup_mdcr_el2()Oliver Upton1-5/+9
Expecting the callee to know when MDCR_EL2 needs to be written to hardware asking for trouble. Do the deed from kvm_arm_setup_mdcr_el2() instead. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-6-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Move host SME/SVE tracking flags to host dataOliver Upton2-18/+16
The SME/SVE state tracking flags have no business in the vCPU. Move them to kvm_host_data. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-5-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Track presence of SPE/TRBE in kvm_host_data instead of vCPUOliver Upton4-42/+24
Add flags to kvm_host_data to track if SPE/TRBE is present + programmable on a per-CPU basis. Set the flags up at init rather than vcpu_load() as the programmability of these buffers is unlikely to change. Reviewed-by: James Clark <james.clark@linaro.org> Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Get rid of __kvm_get_mdcr_el2() and related wartsOliver Upton7-41/+18
KVM caches MDCR_EL2 on a per-CPU basis in order to preserve the configuration of MDCR_EL2.HPMN while running a guest. This is a bit gross, since we're relying on some baked configuration rather than the hardware definition of implemented counters. Discover the number of implemented counters by reading PMCR_EL0.N instead. This works because: - In VHE the kernel runs at EL2, and N always returns the number of counters implemented in hardware - In {n,h}VHE, the EL2 setup code programs MDCR_EL2.HPMN with the EL2 view of PMCR_EL0.N for the host Lastly, avoid traps under nested virtualization by saving PMCR_EL0.N in host data. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-20KVM: arm64: Drop MDSCR_EL1_DEBUG_MASKOliver Upton1-5/+0
Nothing is using this macro, get rid of it. Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241219224116.3941496-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-12-15Linux 6.13-rc3Linus Torvalds1-1/+1
2024-12-14bpf: Avoid deadlock caused by nested kprobe and fentry bpf programsPriya Bala Govindasamy1-0/+6
BPF program types like kprobe and fentry can cause deadlocks in certain situations. If a function takes a lock and one of these bpf programs is hooked to some point in the function's critical section, and if the bpf program tries to call the same function and take the same lock it will lead to deadlock. These situations have been reported in the following bug reports. In percpu_freelist - Link: https://lore.kernel.org/bpf/CAADnVQLAHwsa+2C6j9+UC6ScrDaN9Fjqv1WjB1pP9AzJLhKuLQ@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@mail.gmail.com/T/ In bpf_lru_list - Link: https://lore.kernel.org/bpf/CAPPBnEajj+DMfiR_WRWU5=6A7KKULdB5Rob_NJopFLWF+i9gCA@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEZQDVN6VqnQXvVqGoB+ukOtHGZ9b9U0OLJJYvRoSsMY_g@mail.gmail.com/T/ Link: https://lore.kernel.org/bpf/CAPPBnEaCB1rFAYU7Wf8UxqcqOWKmRPU1Nuzk3_oLk6qXR7LBOA@mail.gmail.com/T/ Similar bugs have been reported by syzbot. In queue_stack_maps - Link: https://lore.kernel.org/lkml/0000000000004c3fc90615f37756@google.com/ Link: https://lore.kernel.org/all/20240418230932.2689-1-hdanton@sina.com/T/ In lpm_trie - Link: https://lore.kernel.org/linux-kernel/00000000000035168a061a47fa38@google.com/T/ In ringbuf - Link: https://lore.kernel.org/bpf/20240313121345.2292-1-hdanton@sina.com/T/ Prevent kprobe and fentry bpf programs from attaching to these critical sections by removing CC_FLAGS_FTRACE for percpu_freelist.o, bpf_lru_list.o, queue_stack_maps.o, lpm_trie.o, ringbuf.o files. The bugs reported by syzbot are due to tracepoint bpf programs being called in the critical sections. This patch does not aim to fix deadlocks caused by tracepoint programs. However, it does prevent deadlocks from occurring in similar situations due to kprobe and fentry programs. Signed-off-by: Priya Bala Govindasamy <pgovind2@uci.edu> Link: https://lore.kernel.org/r/CAPPBnEZpjGnsuA26Mf9kYibSaGLm=oF6=12L21X1GEQdqjLnzQ@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13selftests/bpf: Add tests for raw_tp NULL argsKumar Kartikeya Dwivedi2-0/+27
Add tests to ensure that arguments are correctly marked based on their specified positions, and whether they get marked correctly as maybe null. For modules, all tracepoint parameters should be marked PTR_MAYBE_NULL by default. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13bpf: Augment raw_tp arguments with PTR_MAYBE_NULLKumar Kartikeya Dwivedi2-10/+147
Arguments to a raw tracepoint are tagged as trusted, which carries the semantics that the pointer will be non-NULL. However, in certain cases, a raw tracepoint argument may end up being NULL. More context about this issue is available in [0]. Thus, there is a discrepancy between the reality, that raw_tp arguments can actually be NULL, and the verifier's knowledge, that they are never NULL, causing explicit NULL check branch to be dead code eliminated. A previous attempt [1], i.e. the second fixed commit, was made to simulate symbolic execution as if in most accesses, the argument is a non-NULL raw_tp, except for conditional jumps. This tried to suppress branch prediction while preserving compatibility, but surfaced issues with production programs that were difficult to solve without increasing verifier complexity. A more complete discussion of issues and fixes is available at [2]. Fix this by maintaining an explicit list of tracepoints where the arguments are known to be NULL, and mark the positional arguments as PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments are known to be ERR_PTR, and mark these arguments as scalar values to prevent potential dereference. Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2), shifted by the zero-indexed argument number x 4. This can be represented as follows: 1st arg: 0x1 2nd arg: 0x10 3rd arg: 0x100 ... and so on (likewise for ERR_PTR case). In the future, an automated pass will be used to produce such a list, or insert __nullable annotations automatically for tracepoints. Each compilation unit will be analyzed and results will be collated to find whether a tracepoint pointer is definitely not null, maybe null, or an unknown state where verifier conservatively marks it PTR_MAYBE_NULL. A proof of concept of this tool from Eduard is available at [3]. Note that in case we don't find a specification in the raw_tp_null_args array and the tracepoint belongs to a kernel module, we will conservatively mark the arguments as PTR_MAYBE_NULL. This is because unlike for in-tree modules, out-of-tree module tracepoints may pass NULL freely to the tracepoint. We don't protect against such tracepoints passing ERR_PTR (which is uncommon anyway), lest we mark all such arguments as SCALAR_VALUE. While we are it, let's adjust the test raw_tp_null to not perform dereference of the skb->mark, as that won't be allowed anymore, and make it more robust by using inline assembly to test the dead code elimination behavior, which should still stay the same. [0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb [1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com [2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com [3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"Kumar Kartikeya Dwivedi4-87/+9
This patch reverts commit cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The patch was well-intended and meant to be as a stop-gap fixing branch prediction when the pointer may actually be NULL at runtime. Eventually, it was supposed to be replaced by an automated script or compiler pass detecting possibly NULL arguments and marking them accordingly. However, it caused two main issues observed for production programs and failed to preserve backwards compatibility. First, programs relied on the verifier not exploring == NULL branch when pointer is not NULL, thus they started failing with a 'dereference of scalar' error. Next, allowing raw_tp arguments to be modified surfaced the warning in the verifier that warns against reg->off when PTR_MAYBE_NULL is set. More information, context, and discusson on both problems is available in [0]. Overall, this approach had several shortcomings, and the fixes would further complicate the verifier's logic, and the entire masking scheme would have to be removed eventually anyway. Hence, revert the patch in preparation of a better fix avoiding these issues to replace this commit. [0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com Reported-by: Manu Bretelle <chantra@meta.com> Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-13ARC: build: Try to guess GCC variant of cross compilerLeon Romanovsky1-1/+1
ARC GCC compiler is packaged starting from Fedora 39i and the GCC variant of cross compile tools has arc-linux-gnu- prefix and not arc-linux-. This is causing that CROSS_COMPILE variable is left unset. This change allows builds without need to supply CROSS_COMPILE argument if distro package is used. Before this change: $ make -j 128 ARCH=arc W=1 drivers/infiniband/hw/mlx4/ gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead gcc: error: unrecognized command-line option ‘-mmedium-calls’ gcc: error: unrecognized command-line option ‘-mlock’ gcc: error: unrecognized command-line option ‘-munaligned-access’ [1] https://packages.fedoraproject.org/pkgs/cross-gcc/gcc-arc-linux-gnu/index.html Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Vineet Gupta <vgupta@kernel.org>
2024-12-13KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module initSean Christopherson3-5/+29
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to avoid the overead of CPUID during runtime. The offset, size, and metadata for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e. is constant for a given CPU, and thus can be cached during module load. On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where recomputing XSAVE offsets and sizes results in a 4x increase in latency of nested VM-Enter and VM-Exit (nested transitions can trigger xstate_required_size() multiple times per transition), relative to using cached values. The issue is easily visible by running `perf top` while triggering nested transitions: kvm_update_cpuid_runtime() shows up at a whopping 50%. As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald Rapids (EMR) takes: SKX 11650 ICX 22350 EMR 28850 Using cached values, the latency drops to: SKX 6850 ICX 9000 EMR 7900 The underlying issue is that CPUID itself is slow on ICX, and comically slow on EMR. The problem is exacerbated on CPUs which support XSAVES and/or XSAVEC, as KVM invokes xstate_required_size() twice on each runtime CPUID update, and because there are more supported XSAVE features (CPUID for supported XSAVE feature sub-leafs is significantly slower). SKX: CPUID.0xD.2 = 348 cycles CPUID.0xD.3 = 400 cycles CPUID.0xD.4 = 276 cycles CPUID.0xD.5 = 236 cycles <other sub-leaves are similar> EMR: CPUID.0xD.2 = 1138 cycles CPUID.0xD.3 = 1362 cycles CPUID.0xD.4 = 1068 cycles CPUID.0xD.5 = 910 cycles CPUID.0xD.6 = 914 cycles CPUID.0xD.7 = 1350 cycles CPUID.0xD.8 = 734 cycles CPUID.0xD.9 = 766 cycles CPUID.0xD.10 = 732 cycles CPUID.0xD.11 = 718 cycles CPUID.0xD.12 = 734 cycles CPUID.0xD.13 = 1700 cycles CPUID.0xD.14 = 1126 cycles CPUID.0xD.15 = 898 cycles CPUID.0xD.16 = 716 cycles CPUID.0xD.17 = 748 cycles CPUID.0xD.18 = 776 cycles Note, updating runtime CPUID information multiple times per nested transition is itself a flaw, especially since CPUID is a mandotory intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated CPUID state is up-to-date while running L2. That flaw will be fixed in a future patch, as deferring runtime CPUID updates is more subtle than it appears at first glance, the benefits aren't super critical to have once the XSAVE issue is resolved, and caching CPUID output is desirable even if KVM's updates are deferred. Cc: Jim Mattson <jmattson@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241211013302.1347853-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-12-13block: Fix potential deadlock while freezing queue and acquiring sysfs_lockNilay Shroff3-23/+26
For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: kjain@linux.ibm.com Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Tested-by: kjain@linux.ibm.com Cc: hch@lst.de Cc: axboe@kernel.dk Cc: ritesh.list@gmail.com Cc: ming.lei@redhat.com Cc: gjoyce@linux.ibm.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13irqchip/gic-v3: Work around insecure GIC integrationsMarc Zyngier1-1/+16
It appears that the relatively popular RK3399 SoC has been put together using a large amount of illicit substances, as experiments reveal that its integration of GIC500 exposes the *secure* programming interface to non-secure. This has some pretty bad effects on the way priorities are handled, and results in a dead machine if booting with pseudo-NMI enabled (irqchip.gicv3_pseudo_nmi=1) if the kernel contains 18fdb6348c480 ("arm64: irqchip/gic-v3: Select priorities at boot time"), which relies on the priorities being programmed using the NS view. Let's restore some sanity by going one step further and disable security altogether in this case. This is not any worse, and puts us in a mode where priorities actually make some sense. Huge thanks to Mark Kettenis who initially identified this issue on OpenBSD, and to Chen-Yu Tsai who reported the problem in Linux. Fixes: 18fdb6348c480 ("arm64: irqchip/gic-v3: Select priorities at boot time") Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl> Reported-by: Chen-Yu Tsai <wens@csie.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chen-Yu Tsai <wens@csie.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241213141037.3995049-1-maz@kernel.org
2024-12-13irqchip/gic: Correct declaration of *percpu_base pointer in union gic_baseUros Bizjak1-1/+1
percpu_base is used in various percpu functions that expect variable in __percpu address space. Correct the declaration of percpu_base to void __iomem * __percpu *percpu_base; to declare the variable as __percpu pointer. The patch fixes several sparse warnings: irq-gic.c:1172:44: warning: incorrect type in assignment (different address spaces) irq-gic.c:1172:44: expected void [noderef] __percpu *[noderef] __iomem *percpu_base irq-gic.c:1172:44: got void [noderef] __iomem *[noderef] __percpu * ... irq-gic.c:1231:43: warning: incorrect type in argument 1 (different address spaces) irq-gic.c:1231:43: expected void [noderef] __percpu *__pdata irq-gic.c:1231:43: got void [noderef] __percpu *[noderef] __iomem *percpu_base There were no changes in the resulting object files. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/all/20241213145809.2918-2-ubizjak@gmail.com
2024-12-13block: Fix queue_iostats_passthrough_show()Bart Van Assche1-1/+1
Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4. This patch fixes the following sparse warning: block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types) block/blk-sysfs.c:266:31: expected unsigned long var block/blk-sysfs.c:266:31: got restricted blk_flags_t Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 110234da18ab ("block: enable passthrough command statistics") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13blk-mq: Clean up blk_mq_requeue_work()Bart Van Assche1-6/+4
Move a statement that occurs in both branches of an if-statement in front of the if-statement. Fix a typo in a source code comment. No functionality has been changed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13mq-deadline: Remove a local variableBart Van Assche1-4/+1
Since commit fde02699c242 ("block: mq-deadline: Remove support for zone write locking"), the local variable 'insert_before' is assigned once and is used once. Hence remove this local variable. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13kselftest/arm64: abi: fix SVCR detectionWeizhao Ouyang1-17/+15
When using svcr_in to check ZA and Streaming Mode, we should make sure that the value in x2 is correct, otherwise it may trigger an Illegal instruction if FEAT_SVE and !FEAT_SME. Fixes: 43e3f85523e4 ("kselftest/arm64: Add SME support to syscall ABI test") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241211111639.12344-1-o451686892@gmail.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>