aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2021-10-27powerpc/64s/interrupt: Fix check_return_regs_valid() false positiveNicholas Piggin1-1/+1
The check_return_regs_valid() can cause a false positive if the return regs are marked as norestart and they are an HSRR type interrupt, because the low bit in the bottom of regs->trap causes interrupt type matching to fail. This can occcur for example on bare metal with a HV privileged doorbell interrupt that causes a signal, but do_signal returns early because get_signal() fails, and takes the "No signal to deliver" path. In this case no signal was delivered so the return location is not changed so return SRRs are not invalidated, yet set_trap_norestart is called, which messes up the match. Building go-1.16.6 is known to reproduce this. Fix it by using the TRAP() accessor which masks out the low bit. Fixes: 6eaaf9de3599 ("powerpc/64s/interrupt: Check and fix srr_valid without crashing") Cc: stable@vger.kernel.org # v5.14+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211026122531.3599918-1-npiggin@gmail.com
2021-10-27powerpc/boot: Set LC_ALL=C in wrapper scriptChristophe Leroy1-0/+2
While trying to build a simple Image for ACADIA platform, I got the following error: WRAP arch/powerpc/boot/simpleImage.acadia INFO: Uncompressed kernel (size 0x6ae7d0) overlaps the address of the wrapper(0x400000) INFO: Fixing the link_address of wrapper to (0x700000) powerpc64-linux-gnu-ld : mode d'émulation non reconnu : -T Émulations prises en charge : elf64ppc elf32ppc elf32ppclinux elf32ppcsim elf64lppc elf32lppc elf32lppclinux elf32lppcsim make[1]: *** [arch/powerpc/boot/Makefile:424 : arch/powerpc/boot/simpleImage.acadia] Erreur 1 make: *** [arch/powerpc/Makefile:285 : simpleImage.acadia] Erreur 2 Trying again with V=1 shows the following command powerpc64-linux-gnu-ld -m -T arch/powerpc/boot/zImage.lds -Ttext 0x700000 --no-dynamic-linker -o arch/powerpc/boot/simpleImage.acadia -Map wrapper.map arch/powerpc/boot/fixed-head.o arch/powerpc/boot/simpleboot.o ./zImage.3278022.o arch/powerpc/boot/wrapper.a The argument of '-m' is missing. This is due to the wrapper script calling 'objdump -p vmlinux' and looking for 'file format', whereas the output of objdump is: vmlinux: format de fichier elf32-powerpc En-tête de programme: LOAD off 0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16 filesz 0x0069e1d4 memsz 0x006c128c flags rwx NOTE off 0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2 filesz 0x00000054 memsz 0x00000054 flags --- Add LC_ALL=C at the beginning of the wrapper script in order to get the output expected by the script: vmlinux: file format elf32-powerpc Program Header: LOAD off 0x00010000 vaddr 0xc0000000 paddr 0x00000000 align 2**16 filesz 0x0069e1d4 memsz 0x006c128c flags rwx NOTE off 0x0064591c vaddr 0xc063591c paddr 0x0063591c align 2**2 filesz 0x00000054 memsz 0x00000054 flags --- Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a9ff3bc98035f63b122c051f02dc47c7aed10430.1635256089.git.christophe.leroy@csgroup.eu
2021-10-27powerpc/64s: Default to 64K pages for 64 bit book3sJoel Stanley11-6/+5
For 64-bit book3s the default should be 64K as that's what modern CPUs are designed for. The following defconfigs already set CONFIG_PPC_64K_PAGES: cell_defconfig pasemi_defconfig powernv_defconfig ppc64_defconfig pseries_defconfig skiroot_defconfig The have the option removed from the defconfig, as it is now the default. The defconfigs that now need to set CONFIG_PPC_4K_PAGES to maintain their existing behaviour are: g5_defconfig maple_defconfig microwatt_defconfig ps3_defconfig Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> BugLink: https://github.com/linuxppc/issues/issues/109 Link: https://lore.kernel.org/r/20211015001649.45591-1-joel@jms.id.au
2021-10-27Revert "powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC"Michael Ellerman5-8/+135
This reverts commit 566af8cda399c088763d07464463dc871c943b54. This caused some conflicts vs the audit tree, and the audit maintainers would prefer we postpone this to the next merge window so we have more time for testing. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2021-10-25signal/powerpc: On swapcontext failure force SIGSEGVEric W. Biederman2-5/+10
If the register state may be partial and corrupted instead of calling do_exit, call force_sigsegv(SIGSEGV). Which properly kills the process with SIGSEGV and does not let any more userspace code execute, instead of just killing one thread of the process and potentially confusing everything. Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: linuxppc-dev@lists.ozlabs.org History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system call.") Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, from Ben Herrenschmidt") Link: https://lkml.kernel.org/r/20211020174406.17889-7-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-25powerpc/pseries/iommu: Create huge DMA window if no MMIO32 is presentAlexey Kardashevskiy1-6/+6
The iommu_init_table() helper takes an address range to reserve in the IOMMU table being initialized to exclude MMIO addresses, this is useful if the window stretches far beyond 4GB (although wastes some TCEs). At the moment the code searches for such MMIO32 range and fails if none found which is considered a problem while it really is not: it is actually better as this says there is no MMIO32 to reserve and we can use usually wasted TCEs. Furthermore PHYP never actually allows creating windows starting at busaddress=0 so this MMIO32 range is never useful. This removes error exit and initializes the table with zero range if no MMIO32 is detected. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-5-aik@ozlabs.ru
2021-10-25powerpc/pseries/iommu: Check if the default window in use before removing itAlexey Kardashevskiy1-6/+6
At the moment this check is performed after we remove the default window which is late and disallows to revert whatever changes enable_ddw() has made to DMA windows. This moves the check and error exit before removing the window. This raised the message severity from "debug" to "warning" as this should not happen in practice and cannot be triggered by the userspace. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-4-aik@ozlabs.ru
2021-10-25powerpc/pseries/iommu: Use correct vfree for it_mapAlexey Kardashevskiy1-1/+2
The it_map array is vzalloc'ed so use vfree() for it when creating a huge DMA window failed for whatever reason. While at this, write zero to it_map. Fixes: 381ceda88c4c ("powerpc/pseries/iommu: Make use of DDW for indirect mapping") Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020132315.2287178-3-aik@ozlabs.ru
2021-10-24kbuild: use more subdir- for visiting subdirectories while cleaningMasahiro Yamada2-6/+4
Documentation/kbuild/makefiles.rst suggests to use "archclean" for cleaning arch/$(SRCARCH)/boot/, but it is not a hard requirement. Since commit d92cc4d51643 ("kbuild: require all architectures to have arch/$(SRCARCH)/Kbuild"), we can use the "subdir- += boot" trick for all architectures. This can take advantage of the parallel option (-j) for "make clean". I also cleaned up the comments in arch/$(SRCARCH)/Makefile. The "archdep" target no longer exists. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
2021-10-22powerpc/pseries/mobility: ignore ibm, platform-facilities updatesNathan Lynch1-0/+34
On VMs with NX encryption, compression, and/or RNG offload, these capabilities are described by nodes in the ibm,platform-facilities device tree hierarchy: $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/ /sys/firmware/devicetree/base/ibm,platform-facilities/ ├── ibm,compression-v1 ├── ibm,random-v1 └── ibm,sym-encryption-v1 3 directories The acceleration functions that these nodes describe are not disrupted by live migration, not even temporarily. But the post-migration ibm,update-nodes sequence firmware always sends "delete" messages for this hierarchy, followed by an "add" directive to reconstruct it via ibm,configure-connector (log with debugging statements enabled in mobility.c): mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285 mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284 mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283 mobility: removing node /ibm,platform-facilities:4294967286 ... mobility: added node /ibm,platform-facilities:4294967286 Note we receive a single "add" message for the entire hierarchy, and what we receive from the ibm,configure-connector sequence is the top-level platform-facilities node along with its three children. The debug message simply reports the parent node and not the whole subtree. Also, significantly, the nodes added are almost completely equivalent to the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in the leaf nodes is the only property I've observed to differ, and Linux does not use that. So in practice, the sum of update messages Linux receives for this hierarchy is equivalent to minor property updates. We succeed in removing the original hierarchy from the device tree. But the vio bus code is ignorant of this, and does not unbind or relinquish its references. The leaf nodes, still reachable through sysfs, of course still refer to the now-freed ibm,platform-facilities parent node, which makes use-after-free possible: refcount_t: addition on 0; use-after-free. WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0 refcount_warn_saturate+0x160/0x1f0 (unreliable) kobject_get+0xf0/0x100 of_node_get+0x30/0x50 of_get_parent+0x50/0xb0 of_fwnode_get_parent+0x54/0x90 fwnode_count_parents+0x50/0x150 fwnode_full_name_string+0x30/0x110 device_node_string+0x49c/0x790 vsnprintf+0x1c0/0x4c0 sprintf+0x44/0x60 devspec_show+0x34/0x50 dev_attr_show+0x40/0xa0 sysfs_kf_seq_show+0xbc/0x200 kernfs_seq_show+0x44/0x60 seq_read_iter+0x2a4/0x740 kernfs_fop_read_iter+0x254/0x2e0 new_sync_read+0x120/0x190 vfs_read+0x1d0/0x240 Moreover, the "new" replacement subtree is not correctly added to the device tree, resulting in ibm,platform-facilities parent node without the appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy: $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/ /sys/firmware/devicetree/base/ibm,platform-facilities/ 0 directories $ cd /sys/devices/vio ; find . -xtype l -exec file {} + ./ibm,sym-encryption-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1 ./ibm,random-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1 ./ibm,compression-v1/of_node: broken symbolic link to ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1 This is because add_dt_node() -> dlpar_attach_node() attaches only the parent node returned from configure-connector, ignoring any children. This should be corrected for the general case, but fixing that won't help with the stale OF node references, which is the more urgent problem. One way to address that would be to make the drivers respond to node removal notifications, so that node references can be dropped appropriately. But this would likely force the drivers to disrupt active clients for no useful purpose: equivalent nodes are immediately re-added. And recall that the acceleration capabilities described by the nodes remain available throughout the whole process. The solution I believe to be robust for this situation is to convert remove+add of a node with an unchanged phandle to an update of the node's properties in the Linux device tree structure. That would involve changing and adding a fair amount of code, and may take several iterations to land. Until that can be realized we have a confirmed use-after-free and the possibility of memory corruption. So add a limited workaround that discriminates on the node type, ignoring adds and removes. This should be amenable to backporting in the meantime. Fixes: 410bccf97881 ("powerpc/pseries: Partition migration in the kernel") Cc: stable@vger.kernel.org Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
2021-10-22powerpc/32: Don't use a struct based type for pte_tChristophe Leroy3-3/+19
Long time ago we had a config item called STRICT_MM_TYPECHECKS to build the kernel with pte_t defined as a structure in order to perform additional build checks or build it with pte_t defined as a simple type in order to get simpler generated code. Commit 670eea924198 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS") made the struct based definition the only one, considering that the generated code was similar in both cases. That's right on ppc64 because the ABI is such that the content of a struct having a single simple type element is passed as register, but on ppc32 such a structure is passed via the stack like any structure. Simple test function: pte_t test(pte_t pte) { return pte; } Before this patch we get c00108ec <test>: c00108ec: 81 24 00 00 lwz r9,0(r4) c00108f0: 91 23 00 00 stw r9,0(r3) c00108f4: 4e 80 00 20 blr So, for PPC32, restore the simple type behaviour we got before commit 670eea924198, but instead of adding a config option to activate type check, do it when __CHECKER__ is set so that type checking is performed by 'sparse' and provides feedback like: arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return expression (different base types) arch/powerpc/mm/pgtable.c:466:16: expected unsigned long arch/powerpc/mm/pgtable.c:466:16: got struct pte_t [usertype] x With this patch we now get c0010890 <test>: c0010890: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Define STRICT_MM_TYPECHECKS rather than repeating the condition] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c904599f33aaf6bb7ee2836a9ff8368509e0d78d.1631887042.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/breakpoint: CleanupChristophe Leroy1-12/+3
cache_op_size() does exactly the same as l1_dcache_bytes(). Remove it. MSR_64BIT already exists, no need to enclode the check around #ifdef __powerpc64__ Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6184b08088312a7d787d450eb902584e4ae77f7a.1632317816.git.christophe.leroy@csgroup.eu
2021-10-22powerpc: Activate CONFIG_STRICT_KERNEL_RWX by defaultChristophe Leroy1-0/+1
CONFIG_STRICT_KERNEL_RWX should be set by default on every architectures (See https://github.com/KSPP/linux/issues/4) On PPC32 we have to find a compromise between performance and/or memory wasting and selection of strict_kernel_rwx, because it implies either smaller memory chunks or larger alignment between RO memory and RW memory. For instance the 8xx maps memory with 8M pages. So either the limit between RO and RW must be 8M aligned or it falls back or 512k pages which implies more pressure on the TLB. book3s/32 maps memory with BATs as much as possible. BATS can have any power-of-two size between 128k and 256M but we have only 4 to 8 BATs so the alignment must be good enough to allow efficient use of the BATs and avoid falling back on standard page mapping which would kill performance. So let's go one step forward and make it the default but still allow users to unset it when wanted. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/057c40164084bfc7d77c0b2ff78d95dbf6a2a21b.1632503622.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/8xx: Simplify TLB handlingChristophe Leroy2-0/+17
In the old days, TLB handling for 8xx was using tlbie and tlbia instructions directly as much as possible. But commit f048aace29e0 ("powerpc/mm: Add SMP support to no-hash TLB handling") broke that by introducing out-of-line unnecessary complex functions for booke/smp which don't have tlbie/tlbia instructions and require more complex handling. Restore direct use of tlbie and tlbia for 8xx which is never SMP. With this patch we now get c00ecc68 <ptep_clear_flush>: c00ecc68: 39 00 00 00 li r8,0 c00ecc6c: 81 46 00 00 lwz r10,0(r6) c00ecc70: 91 06 00 00 stw r8,0(r6) c00ecc74: 7c 00 2a 64 tlbie r5,r0 c00ecc78: 7c 00 04 ac hwsync c00ecc7c: 91 43 00 00 stw r10,0(r3) c00ecc80: 4e 80 00 20 blr Before it was c0012880 <local_flush_tlb_page>: c0012880: 2c 03 00 00 cmpwi r3,0 c0012884: 41 82 00 54 beq c00128d8 <local_flush_tlb_page+0x58> c0012888: 81 22 00 00 lwz r9,0(r2) c001288c: 81 43 00 20 lwz r10,32(r3) c0012890: 39 29 00 01 addi r9,r9,1 c0012894: 91 22 00 00 stw r9,0(r2) c0012898: 2c 0a 00 00 cmpwi r10,0 c001289c: 41 82 00 10 beq c00128ac <local_flush_tlb_page+0x2c> c00128a0: 81 2a 01 dc lwz r9,476(r10) c00128a4: 2c 09 ff ff cmpwi r9,-1 c00128a8: 41 82 00 0c beq c00128b4 <local_flush_tlb_page+0x34> c00128ac: 7c 00 22 64 tlbie r4,r0 c00128b0: 7c 00 04 ac hwsync c00128b4: 81 22 00 00 lwz r9,0(r2) c00128b8: 39 29 ff ff addi r9,r9,-1 c00128bc: 2c 09 00 00 cmpwi r9,0 c00128c0: 91 22 00 00 stw r9,0(r2) c00128c4: 4c a2 00 20 bclr+ 4,eq c00128c8: 81 22 00 70 lwz r9,112(r2) c00128cc: 71 29 00 04 andi. r9,r9,4 c00128d0: 4d 82 00 20 beqlr c00128d4: 48 65 76 74 b c0669f48 <preempt_schedule> c00128d8: 81 22 00 00 lwz r9,0(r2) c00128dc: 39 29 00 01 addi r9,r9,1 c00128e0: 91 22 00 00 stw r9,0(r2) c00128e4: 4b ff ff c8 b c00128ac <local_flush_tlb_page+0x2c> ... c00ecdc8 <ptep_clear_flush>: c00ecdc8: 94 21 ff f0 stwu r1,-16(r1) c00ecdcc: 39 20 00 00 li r9,0 c00ecdd0: 93 c1 00 08 stw r30,8(r1) c00ecdd4: 83 c6 00 00 lwz r30,0(r6) c00ecdd8: 91 26 00 00 stw r9,0(r6) c00ecddc: 93 e1 00 0c stw r31,12(r1) c00ecde0: 7c 08 02 a6 mflr r0 c00ecde4: 7c 7f 1b 78 mr r31,r3 c00ecde8: 7c 83 23 78 mr r3,r4 c00ecdec: 7c a4 2b 78 mr r4,r5 c00ecdf0: 90 01 00 14 stw r0,20(r1) c00ecdf4: 4b f2 5a 8d bl c0012880 <local_flush_tlb_page> c00ecdf8: 93 df 00 00 stw r30,0(r31) c00ecdfc: 7f e3 fb 78 mr r3,r31 c00ece00: 80 01 00 14 lwz r0,20(r1) c00ece04: 83 c1 00 08 lwz r30,8(r1) c00ece08: 83 e1 00 0c lwz r31,12(r1) c00ece0c: 7c 08 03 a6 mtlr r0 c00ece10: 38 21 00 10 addi r1,r1,16 c00ece14: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/fb324f1c8f2ddb57cf6aad1cea26329558f1c1c0.1631887021.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/lib/sstep: Don't use __{get/put}_user() on kernel addressesChristophe Leroy1-57/+140
In the old days, when we didn't have kernel userspace access protection and had set_fs(), it was wise to use __get_user() and friends to read kernel memory. Nowadays, get_user() and put_user() are granting userspace access and are exclusively for userspace access. Convert single step emulation functions to user_access_begin() and friends and use unsafe_get_user() and unsafe_put_user(). When addressing kernel addresses, there is no need to open userspace access. And for book3s/32 it is particularly important to no try and open userspace access on kernel address, because that would break the content of kernel space segment registers. No guard has been put against that risk in order to avoid degrading performance. copy_from_kernel_nofault() and copy_to_kernel_nofault() should be used but they are out-of-line functions which would degrade performance. Those two functions are making use of __get_kernel_nofault() and __put_kernel_nofault() macros. Those two macros are just wrappers behind __get_user_size_goto() and __put_user_size_goto(). unsafe_get_user() and unsafe_put_user() are also wrappers of __get_user_size_goto() and __put_user_size_goto(). Use them to access kernel space. That allows refactoring userspace and kernelspace access. Reported-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Depends-on: 4fe5cda9f89d ("powerpc/uaccess: Implement user_read_access_begin and user_write_access_begin") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/22831c9d17f948680a12c5292e7627288b15f713.1631817805.git.christophe.leroy@csgroup.eu
2021-10-22powerpc: warn on emulation of dcbz instruction in kernel modeChristophe Leroy1-0/+1
dcbz instruction shouldn't be used on non-cached memory. Using it on non-cached memory can result in alignment exception and implies a heavy handling. Instead of silentely emulating the instruction and resulting in high performance degradation, warn whenever an alignment exception is taken in kernel mode due to dcbz, so that the user is made aware that dcbz instruction has been used unexpectedly by the kernel. Reported-by: Stan Johnson <userm57@yahoo.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2e3acfe63d289c6fba366e16973c9ab8369e8b75.1631803922.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/32: Add support for out-of-line static callsChristophe Leroy4-1/+67
Add support for out-of-line static calls on PPC32. This change improve performance of calls to global function pointers by using direct calls instead of indirect calls. The trampoline is initialy populated with a 'blr' or branch to target, followed by an unreachable long jump sequence. In order to cater with parallele execution, the trampoline needs to be updated in a way that ensures it remains consistent at all time. This means we can't use the traditional lis/addi to load r12 with the target address, otherwise there would be a window during which the first instruction contains the upper part of the new target address while the second instruction still contains the lower part of the old target address. To avoid that the target address is stored just after the 'bctr' and loaded from there with a single instruction. Then, depending on the target distance, arch_static_call_transform() will either replace the first instruction by a direct 'bl <target>' or 'nop' in order to have the trampoline fall through the long jump sequence. For the special case of __static_call_return0(), to avoid the risk of a far branch, a version of it is inlined at the end of the trampoline. Performancewise the long jump sequence is probably not better than the indirect calls set by GCC when we don't use static calls, but such calls are unlikely to be required on powerpc32: With most configurations the kernel size is far below 32 Mbytes so only modules may happen to be too far. And even modules are likely to be close enough as they are allocated below the kernel core and as close as possible of the kernel text. static_call selftest is running successfully with this change. With this patch, __do_irq() has the following sequence to trace irq entries: c0004a00 <__SCT__tp_func_irq_entry>: c0004a00: 48 00 00 e0 b c0004ae0 <__traceiter_irq_entry> c0004a04: 3d 80 c0 00 lis r12,-16384 c0004a08: 81 8c 4a 1c lwz r12,18972(r12) c0004a0c: 7d 89 03 a6 mtctr r12 c0004a10: 4e 80 04 20 bctr c0004a14: 38 60 00 00 li r3,0 c0004a18: 4e 80 00 20 blr c0004a1c: 00 00 00 00 .long 0x0 ... c0005654 <__do_irq>: ... c0005664: 7c 7f 1b 78 mr r31,r3 ... c00056a0: 81 22 00 00 lwz r9,0(r2) c00056a4: 39 29 00 01 addi r9,r9,1 c00056a8: 91 22 00 00 stw r9,0(r2) c00056ac: 3d 20 c0 af lis r9,-16209 c00056b0: 81 29 74 cc lwz r9,29900(r9) c00056b4: 2c 09 00 00 cmpwi r9,0 c00056b8: 41 82 00 10 beq c00056c8 <__do_irq+0x74> c00056bc: 80 69 00 04 lwz r3,4(r9) c00056c0: 7f e4 fb 78 mr r4,r31 c00056c4: 4b ff f3 3d bl c0004a00 <__SCT__tp_func_irq_entry> Before this patch, __do_irq() was doing the following to trace irq entries: c0005700 <__do_irq>: ... c0005710: 7c 7e 1b 78 mr r30,r3 ... c000574c: 93 e1 00 0c stw r31,12(r1) c0005750: 81 22 00 00 lwz r9,0(r2) c0005754: 39 29 00 01 addi r9,r9,1 c0005758: 91 22 00 00 stw r9,0(r2) c000575c: 3d 20 c0 af lis r9,-16209 c0005760: 83 e9 f4 cc lwz r31,-2868(r9) c0005764: 2c 1f 00 00 cmpwi r31,0 c0005768: 41 82 00 24 beq c000578c <__do_irq+0x8c> c000576c: 81 3f 00 00 lwz r9,0(r31) c0005770: 80 7f 00 04 lwz r3,4(r31) c0005774: 7d 29 03 a6 mtctr r9 c0005778: 7f c4 f3 78 mr r4,r30 c000577c: 4e 80 04 21 bctrl c0005780: 85 3f 00 0c lwzu r9,12(r31) c0005784: 2c 09 00 00 cmpwi r9,0 c0005788: 40 82 ff e4 bne c000576c <__do_irq+0x6c> Behind the fact of now using a direct 'bl' instead of a 'load/mtctr/bctr' sequence, we can also see that we get one less register on the stack. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/machdep: Remove stale functions from ppc_md structureChristophe Leroy9-44/+2
ppc_md.iommu_save() is not set anymore by any platform after commit c40785ad305b ("powerpc/dart: Use a cachable DART"). So iommu_save() has become a nop and can be removed. ppc_md.show_percpuinfo() is not set anymore by any platform after commit 4350147a816b ("[PATCH] ppc64: SMU based macs cpufreq support"). Last users of ppc_md.rtc_read_val() and ppc_md.rtc_write_val() were removed by commit 0f03a43b8f0f ("[POWERPC] Remove todc code from ARCH=powerpc") Last user of kgdb_map_scc() was removed by commit 17ce452f7ea3 ("kgdb, powerpc: arch specific powerpc kgdb support"). ppc.machine_kexec_prepare() has not been used since commit 8ee3e0d69623 ("powerpc: Remove the main legacy iSerie platform code"). This allows the removal of machine_kexec_prepare() and the rename of default_machine_kexec_prepare() into machine_kexec_prepare() Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Daniel Axtens <dja@axtens.net> [mpe: Drop prototype for default_machine_kexec_prepare() as noted by dja] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/24d4ca0ada683c9436a5f812a7aeb0a1362afa2b.1630398606.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/time: Remove generic_suspend_{dis/en}able_irqs()Christophe Leroy1-15/+7
Commit d75d68cfef49 ("powerpc: Clean up obsolete code relating to decrementer and timebase") made generic_suspend_enable_irqs() and generic_suspend_disable_irqs() static. Fold them into their only caller. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/c3f9ec9950394ef939014f7934268e6ee30ca04f.1630398566.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERICChristophe Leroy5-135/+8
Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal targets") added generic support for AUDIT but that didn't include support for bi-arch like powerpc. Commit 4b58841149dc ("audit: Add generic compat syscall support") added generic support for bi-arch. Convert powerpc to that bi-arch generic audit support. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a4b3951d1191d4183d92a07a6097566bde60d00a.1629812058.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regsChristophe Leroy1-2/+2
Instructions lmw/stmw are interesting for functions that are rarely used and not in the cache, because only one instruction is to be copied into the instruction cache instead of 19. However those instruction are less performant than 19x raw lwz/stw as they require synchronisation plus one additional cycle. SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are mostly in interrupts entries/exits and in task switch so they are likely already in the cache. Using standard lwz improves null_syscall selftest by: - 10 cycles on mpc832x. - 2 cycles on mpc8xx. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/316c543b8906712c108985c8463eec09c8db577b.1629732542.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/5200: dts: fix memory node unit nameAnatolij Gustschin12-12/+12
Fixes build warnings: Warning (unit_address_vs_reg): /memory: node has a reg or ranges property, but no unit name Signed-off-by: Anatolij Gustschin <agust@denx.de> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211013220532.24759-4-agust@denx.de
2021-10-22powerpc/5200: dts: fix pci ranges warningsAnatolij Gustschin10-30/+30
Fix ranges property warnings: pci@f0000d00:ranges: 'oneOf' conditional failed, one must be fixed: Signed-off-by: Anatolij Gustschin <agust@denx.de> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211013220532.24759-3-agust@denx.de
2021-10-22powerpc/5200: dts: add missing pci rangesAnatolij Gustschin1-1/+3
Add ranges property to fix build warnings: Warning (pci_bridge): /pci@f0000d00: missing ranges for PCI bridge (or not a bridge) Signed-off-by: Anatolij Gustschin <agust@denx.de> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211013220532.24759-2-agust@denx.de
2021-10-22powerpc/vas: Fix potential NULL pointer dereferenceGustavo A. R. Silva1-2/+2
(!ptr && !ptr->foo) strikes again. :) The expression (!ptr && !ptr->foo) is bogus and in case ptr is NULL, it leads to a NULL pointer dereference: ptr->foo. Fix this by converting && to || This issue was detected with the help of Coccinelle, and audited and fixed manually. Fixes: 1a0d0d5ed5e3 ("powerpc/vas: Add platform specific user window operations") Cc: stable@vger.kernel.org Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015050345.GA1161918@embeddedor
2021-10-22powerpc/fsl_booke: Enable STRICT_KERNEL_RWXChristophe Leroy1-2/+9
Enable STRICT_KERNEL_RWX on fsl_booke. For that, we need additional TLBCAMs dedicated to linear mapping, based on the alignment of _sinittext. By default, up to 768 Mbytes of memory are mapped. It uses 3 TLBCAMs of size 256 Mbytes. With a data alignment of 16, we need up to 9 TLBCAMs: 16/16/16/16/64/64/64/256/256 With a data alignment of 4, we need up to 12 TLBCAMs: 4/4/4/4/16/16/16/64/64/64/256/256 With a data alignment of 1, we need up to 15 TLBCAMs: 1/1/1/1/4/4/4/16/16/16/64/64/64/256/256 By default, set a 16 Mbytes alignment as a compromise between memory usage and number of TLBCAMs. This can be adjusted manually when needed. For the time being, it doens't work when the base is randomised. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/29f9e5d2bbbc83ae9ca879265426a6278bf4d5bb.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Update of TLBCAMs after initChristophe Leroy2-5/+29
After init, set readonly memory as ROX and set readwrite memory as RWX, if STRICT_KERNEL_RWX is enabled. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/66bef0b9c273e1121706883f3cf5ad0a053d863f.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Allocate separate TLBCAMs for readonly memoryChristophe Leroy1-3/+22
Reorganise TLBCAM allocation so that when STRICT_KERNEL_RWX is enabled, TLBCAMs are allocated such that readonly memory uses different TLBCAMs. This results in an allocation looking like: Memory CAM mapping: 4/4/4/1/1/1/1/16/16/16/64/64/64/256/256 Mb, residual: 256Mb Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/8ca169bc288261a0e0558712f979023c3a960ebb.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Tell map_mem_in_cams() if init is doneChristophe Leroy4-10/+10
In order to be able to call map_mem_in_cams() once more after init for STRICT_KERNEL_RWX, add an argument. For now, map_mem_in_cams() is always called only during init. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/3b69a7e0b393b16984ade882a5eae5d727717459.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Enable reloading of TLBCAM without switching to AS1Christophe Leroy1-2/+6
Avoid switching to AS1 when reloading TLBCAM after init for STRICT_KERNEL_RWX. When we setup AS1 we expect the entire accessible memory to be mapped through one entry, this is not the case anymore at the end of init. We are not changing the size of TLBCAMs, only flags, so no need to switch to AS1. So change loadcam_multi() to not switch to AS1 when the given temporary tlb entry in 0. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/a9d517fbfbc940f56103c46b323f6eb8f4485571.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Take exec flag into account when setting TLBCAMsChristophe Leroy1-4/+6
Don't force MAS3_SX and MAS3_UX at all time. Take into account the exec flag. While at it, fix a couple of closeby style problems (indent with space and unnecessary parenthesis), it keeps more readability. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5467044e59f27f9fcf709b9661779e3ce5f784f6.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/fsl_booke: Rename fsl_booke.c to fsl_book3e.cChristophe Leroy2-2/+2
We have a myriad of CONFIG symbols around different variants of BOOKEs, which would be worth tidying up one day. But at least, make file names and CONFIG option match: We have CONFIG_FSL_BOOKE and CONFIG_PPC_FSL_BOOK3E. fsl_booke.c is selected by and only by CONFIG_PPC_FSL_BOOK3E. So rename it fsl_book3e to reduce confusion. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5dc871db1f67739319bec11f049ca450da1c13a2.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCEChristophe Leroy1-3/+3
fsl_booke and 44x are not able to map kernel linear memory with pages, so they can't support DEBUG_PAGEALLOC and KFENCE, and STRICT_KERNEL_RWX is also a problem for now. Enable those only on book3s (both 32 and 64 except KFENCE), 8xx and 40x. Fixes: 88df6e90fa97 ("[POWERPC] DEBUG_PAGEALLOC for 32-bit") Fixes: 95902e6c8864 ("powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32") Fixes: 90cbac0e995d ("powerpc: Enable KFENCE for PPC32") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/d1ad9fdd9b27da3fdfa16510bb542ed51fa6e134.1634292136.git.christophe.leroy@csgroup.eu
2021-10-22powerpc/kexec_file: Add of_node_put() before gotoWan Jiabing1-0/+1
Fix following coccicheck warning: ./arch/powerpc/kexec/file_load_64.c:698:1-22: WARNING: Function for_each_node_by_type should have of_node_put() before goto Early exits from for_each_node_by_type should decrement the node reference counter. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211018015418.10182-1-wanjiabing@vivo.com
2021-10-22powerpc/pseries/iommu: Add of_node_put() before breakWan Jiabing1-1/+3
Fix following coccicheck warning: ./arch/powerpc/platforms/pseries/iommu.c:924:1-28: WARNING: Function for_each_node_with_property should have of_node_put() before break Early exits from for_each_node_with_property should decrement the node reference counter. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211014075624.16344-1-wanjiabing@vivo.com
2021-10-22powerpc/s64: Clarify that radix lacks DEBUG_PAGEALLOCJoel Stanley5-1/+23
The page_alloc.c code will call into __kernel_map_pages() when DEBUG_PAGEALLOC is configured and enabled. As the implementation assumes hash, this should crash spectacularly if not for a bit of luck in __kernel_map_pages(). In this function linear_map_hash_count is always zero, the for loop exits without doing any damage. There are no other platforms that determine if they support debug_pagealloc at runtime. Instead of adding code to mm/page_alloc.c to do that, this change turns the map/unmap into a noop when in radix mode and prints a warning once. Signed-off-by: Joel Stanley <joel@jms.id.au> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [mpe: Reformat if per Christophe's suggestion] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211013213438.675095-1-joel@jms.id.au
2021-10-21Merge tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds2-6/+6
Pull powerpc fixes from Michael Ellerman: - Fix a bug exposed by a previous fix, where running guests with certain SMT topologies could crash the host on Power8. - Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is enabled. Thanks to Nathan Lynch, Srikar Dronamraju, and Valentin Schneider. * tag 'powerpc-5.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/smp: do not decrement idle task preempt count in CPU offline powerpc/idle: Don't corrupt back chain when going idle
2021-10-20KVM: PPC: Replace zero-length array with flexible array memberLen Baker2-3/+2
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members" [1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Also, make use of the struct_size() helper in kzalloc(). [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arrays Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20powerpc: Use of_get_cpu_hwid()Rob Herring1-6/+1
Replace open coded parsing of CPU nodes' 'reg' property with of_get_cpu_hwid(). Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211006164332.1981454-8-robh@kernel.org
2021-10-20powerpc/smp: do not decrement idle task preempt count in CPU offlineNathan Lynch1-2/+0
With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we get: BUG: scheduling while atomic: swapper/1/0/0x00000000 no locks held by swapper/1/0. CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100 Call Trace: dump_stack_lvl+0xac/0x108 __schedule_bug+0xac/0xe0 __schedule+0xcf8/0x10d0 schedule_idle+0x3c/0x70 do_idle+0x2d8/0x4a0 cpu_startup_entry+0x38/0x40 start_secondary+0x2ec/0x3a0 start_secondary_prolog+0x10/0x14 This is because powerpc's arch_cpu_idle_dead() decrements the idle task's preempt count, for reasons explained in commit a7c2bb8279d2 ("powerpc: Re-enable preemption before cpu_die()"), specifically "start_secondary() expects a preempt_count() of 0." However, since commit 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") and commit f1a0a376ca0c ("sched/core: Initialize the idle task with preemption disabled"), that justification no longer holds. The idle task isn't supposed to re-enable preemption, so remove the vestigial preempt_enable() from the CPU offline path. Tested with pseries and powernv in qemu, and pseries on PowerVM. Fixes: 2c669ef6979c ("powerpc/preempt: Don't touch the idle task's preempt_count during hotplug") Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015173902.2278118-1-nathanl@linux.ibm.com
2021-10-20powerpc/idle: Don't corrupt back chain when going idleMichael Ellerman1-4/+6
In isa206_idle_insn_mayloss() we store various registers into the stack red zone, which is allowed. However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again, to 0(r1), which corrupts the stack back chain. We used to do the same in isa206_idle_insn_mayloss() itself, but we fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving GPRs"), however we missed that the macro also corrupts the back chain. Corrupting the back chain is bad for debuggability but doesn't necessarily cause a bug. However we recently changed the stack handling in some KVM code, and it now relies on the stack back chain being valid when it returns. The corruption causes that code to return with r1 pointing somewhere in kernel data, at some point LR is restored from the stack and we branch to NULL or somewhere else invalid. Only affects Power8 hosts running KVM guests, with dynamic_mt_modes enabled (which it is by default). The fixes tag below points to the commit that changed the KVM stack handling, exposing this bug. The actual corruption of the back chain has always existed since 948cf67c4726 ("powerpc: Add NAP mode support on Power7 in HV mode"). Fixes: 9b4416c5095c ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211020094826.3222052-1-mpe@ellerman.id.au
2021-10-19powerpc/perf: Fix data source encodings for L2.1 and L3.1 accessesKajol Jain2-5/+23
Fix the data source encodings to represent L2.1/L3.1(another core's L2/L3 on the same node) accesses properly for power10 and older plaforms. Add new macros(LEVEL/REM) which can be used to add mem_lvl_num and remote field data inside perf_mem_data_src structure. Result in power9 system with patch changes: localhost:~/linux/tools/perf # ./perf mem report | grep Remote 0.01% 1 252 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff7f25eb90 anon HitM N/A No N/A 0 0 0.01% 1 220 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff77776d90 anon HitM N/A No N/A 0 0 0.01% 1 220 Remote core, same node L3 or L3 hit [.] 0x0000000000002dd0 producer_consumer [.] 0x00007fff817d9410 anon HitM N/A No N/A 0 0 Fixes: 79e96f8f930d ("powerpc/perf: Export memory hierarchy info to user space") Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211006140654.298352-5-kjain@linux.ibm.com
2021-10-18gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}Andreas Gruenbacher3-4/+5
Turn fault_in_pages_{readable,writeable} into versions that return the number of bytes not faulted in, similar to copy_to_user, instead of returning a non-zero value when any of the requested pages couldn't be faulted in. This supports the existing users that require all pages to be faulted in as well as new users that are happy if any pages can be faulted in. Rename the functions to fault_in_{readable,writeable} to make sure this change doesn't silently break things. Neither of these functions is entirely trivial and it doesn't seem useful to inline them, so move them to mm/gup.c. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18powerpc/kvm: Fix kvm_use_magic_pageAndreas Gruenbacher1-1/+1
When switching from __get_user to fault_in_pages_readable, commit 9f9eae5ce717 broke kvm_use_magic_page: like __get_user, fault_in_pages_readable returns 0 on success. Fixes: 9f9eae5ce717 ("powerpc/kvm: Prefer fault_in_pages_readable function") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2021-10-18powerpc/eeh: Use to_pci_driver() instead of pci_dev->driverUwe Kleine-König1-5/+5
Struct pci_driver contains a struct device_driver, so for PCI devices, it's easy to convert a device_driver * to a pci_driver * with to_pci_driver(). The device_driver * is in struct device, so we don't need to also keep track of the pci_driver * in struct pci_dev. Replace pdev->driver with to_pci_driver(). This is a step toward removing pci_dev->driver. [bhelgaas: split to separate patch] Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2021-10-18mm: don't include <linux/blk-cgroup.h> in <linux/backing-dev.h>Christoph Hellwig1-0/+1
There is no need to pull blk-cgroup.h and thus blkdev.h in here, so break the include chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-17Merge tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds2-12/+19
Pull powerpc fixes from Michael Ellerman: - Fix a bug where guests on P9 with interrupts passed through could get stuck in synchronize_irq(). - Fix a bug in KVM on P8 where secondary threads entering a guest would write outside their allocated stack. - Fix a bug in KVM on P8 where secondary threads could confuse the host offline code and cause the guest or host to crash. Thanks to Cédric Le Goater. * tag 'powerpc-5.15-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest() powerpc/xive: Discard disabled interrupts in get_irqchip_state()
2021-10-16KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guestMichael Ellerman1-2/+7
We call idle_kvm_start_guest() from power7_offline() if the thread has been requested to enter KVM. We pass it the SRR1 value that was returned from power7_idle_insn() which tells us what sort of wakeup we're processing. Depending on the SRR1 value we pass in, the KVM code might enter the guest, or it might return to us to do some host action if the wakeup requires it. If idle_kvm_start_guest() is able to handle the wakeup, and enter the guest it is supposed to indicate that by returning a zero SRR1 value to us. That was the behaviour prior to commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C"), however in that commit the handling of SRR1 was reworked, and the zeroing behaviour was lost. Returning from idle_kvm_start_guest() without zeroing the SRR1 value can confuse the host offline code, causing the guest to crash and other weirdness. Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015133929.832061-2-mpe@ellerman.id.au
2021-10-16KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()Michael Ellerman1-9/+10
In commit 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") kvm_start_guest() became idle_kvm_start_guest(). The old code allocated a stack frame on the emergency stack, but didn't use the frame to store anything, and also didn't store anything in its caller's frame. idle_kvm_start_guest() on the other hand is written more like a normal C function, it creates a frame on entry, and also stores CR/LR into its callers frame (per the ABI). The problem is that there is no caller frame on the emergency stack. The emergency stack for a given CPU is allocated with: paca_ptrs[i]->emergency_sp = alloc_stack(limit, i) + THREAD_SIZE; So emergency_sp actually points to the first address above the emergency stack allocation for a given CPU, we must not store above it without first decrementing it to create a frame. This is different to the regular kernel stack, paca->kstack, which is initialised to point at an initial frame that is ready to use. idle_kvm_start_guest() stores the backchain, CR and LR all of which write outside the allocation for the emergency stack. It then creates a stack frame and saves the non-volatile registers. Unfortunately the frame it creates is not large enough to fit the non-volatiles, and so the saving of the non-volatile registers also writes outside the emergency stack allocation. The end result is that we corrupt whatever is at 0-24 bytes, and 112-248 bytes above the emergency stack allocation. In practice this has gone unnoticed because the memory immediately above the emergency stack happens to be used for other stack allocations, either another CPUs mc_emergency_sp or an IRQ stack. See the order of calls to irqstack_early_init() and emergency_stack_init(). The low addresses of another stack are the top of that stack, and so are only used if that stack is under extreme pressue, which essentially never happens in practice - and if it did there's a high likelyhood we'd crash due to that stack overflowing. Still, we shouldn't be corrupting someone else's stack, and it is purely luck that we aren't corrupting something else. To fix it we save CR/LR into the caller's frame using the existing r1 on entry, we then create a SWITCH_FRAME_SIZE frame (which has space for pt_regs) on the emergency stack with the backchain pointing to the existing stack, and then finally we switch to the new frame on the emergency stack. Fixes: 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20211015133929.832061-1-mpe@ellerman.id.au
2021-10-15sched: Add wrapper for get_wchan() to keep task blockedKees Cook2-7/+4
Having a stable wchan means the process must be blocked and for it to stay that way while performing stack unwinding. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm] Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org