aboutsummaryrefslogtreecommitdiffstats
path: root/fs (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2020-06-02powerpc: Add new HWCAP bitsAlistair Popple1-0/+2
POWER10 introduces two new architectural features - ISAv3.1 and matrix multiply assist (MMA) instructions. Userspace detects the presence of these features via two HWCAP bits introduced in this patch. These bits have been agreed to by the compiler and binutils team. According to ISAv3.1 MMA is an optional feature and software that makes use of it should first check for availability via this HWCAP bit and use alternate code paths if unavailable. Signed-off-by: Alistair Popple <alistair@popple.id.au> Tested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200521014341.29095-2-alistair@popple.id.au
2020-06-02powerpc/64s: Don't set FSCR bits in INIT_THREADMichael Ellerman1-1/+0
Since the previous commit that saves the value of FSCR configured at boot into init_task.thread.fscr, the static initialisation in INIT_THREAD now no longer has any effect. So remove it. For non DT CPU features, the end result is the same, because __init_FSCR() is called on all CPUs that have an FSCR (Power8, Power9), and it sets FSCR_TAR & FSCR_EBB. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-4-mpe@ellerman.id.au
2020-06-02powerpc/64s: Save FSCR to init_task.thread.fscr after feature initMichael Ellerman1-0/+19
At boot the FSCR is initialised via one of two paths. On most systems it's set to a hard coded value in __init_FSCR(). On newer skiboot systems we use the device tree CPU features binding, where firmware can tell Linux what bits to set in FSCR (and HFSCR). In both cases the value that's configured at boot is not propagated into the init_task.thread.fscr value prior to the initial fork of init (pid 1), which means the value is not used by any processes other than swapper (the idle task). For the __init_FSCR() case this is OK, because the value in init_task.thread.fscr is initialised to something sensible. However it does mean that the value set in __init_FSCR() is not used other than for swapper, which is odd and confusing. The bigger problem is for the device tree CPU features case it prevents firmware from setting (or clearing) FSCR bits for use by user space. This means all existing kernels can not have features enabled/disabled by firmware if those features require setting/clearing FSCR bits. We can handle both cases by saving the FSCR value into init_task.thread.fscr after we have initialised it at boot. This fixes the bug for device tree CPU features, and will allow us to simplify the initialisation for the __init_FSCR() case in a future patch. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-3-mpe@ellerman.id.au
2020-06-02powerpc/64s: Don't let DT CPU features set FSCR_DSCRMichael Ellerman1-0/+8
The device tree CPU features binding includes FSCR bit numbers which Linux is instructed to set by firmware. Whether that's a good idea or not, in the case of the DSCR the Linux implementation has a hard requirement that the FSCR_DSCR bit not be set by default. We use it to track when a process reads/writes to DSCR, so it must be clear to begin with. So if firmware tells us to set FSCR_DSCR we must ignore it. Currently this does not cause a bug in our DSCR handling because the value of FSCR that the device tree CPU features code establishes is only used by swapper. All other tasks use the value hard coded in init_task.thread.fscr. However we'd like to fix that in a future commit, at which point this will become necessary. Fixes: 5a61ef74f269 ("powerpc/64s: Support new device tree binding for discovering CPU features") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-2-mpe@ellerman.id.au
2020-06-02powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()Michael Ellerman1-1/+1
__init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc: Add support for context switching the TAR register") (Feb 2013), and only set FSCR_TAR. At that point FSCR (Facility Status and Control Register) was not context switched, so the setting was permanent after boot. Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again that was permanent after boot. Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on POWER8") (Aug 2013) added a limited context switch of FSCR, just the FSCR_DSCR bit was context switched based on thread.dscr_inherit. That commit said "This clears the H/FSCR DSCR bit initially", but it didn't, it left the initialisation of FSCR_DSCR in __init_FSCR(). However the initial context switch from init_task to pid 1 would clear FSCR_DSCR because thread.dscr_inherit was 0. That commit also introduced the requirement that FSCR_DSCR be clear for user processes, so that we can take the facility unavailable interrupt in order to manage dscr_inherit. Then in commit 152d523e6307 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to thread_struct. However it still wasn't fully context switched, we just took the existing value and set FSCR_DSCR if the new thread had dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR | FSCR_TAR, but that value was not propagated into the thread_struct, so the initial context switch set FSCR_DSCR back to 0. Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context switching") (Jun 2016) added a full context switch of the FSCR, and added an initialisation of init_task.thread.fscr to FSCR_TAR | FSCR_EBB, but omitted FSCR_DSCR. The end result is that swapper runs with FSCR_DSCR set because of the initialisation in __init_FSCR(), but no other processes do, they use the value from init_task.thread.fscr. Having FSCR_DSCR set for swapper allows it to access SPR 3 from userspace, but swapper never runs userspace, so it has no useful effect. It's also confusing to have the value initialised in two places to two different values. So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the point where there's a single value of FSCR, even if it's still set in two places. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Alistair Popple <alistair@popple.id.au> Link: https://lore.kernel.org/r/20200527145843.2761782-1-mpe@ellerman.id.au
2020-06-02powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUGChristophe Leroy1-1/+2
'thread' doesn't exist in kuap_check() macro. Use 'current' instead. Fixes: a68c31fc01ef ("powerpc/32s: Implement Kernel Userspace Access Protection") Cc: stable@vger.kernel.org Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/b459e1600b969047a74e34251a84a3d6fdf1f312.1590858925.git.christophe.leroy@csgroup.eu
2020-06-02powerpc/module_64: Use special stub for _mcount() with -mprofile-kernelNaveen N. Rao1-118/+104
Since commit c55d7b5e64265f ("powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE"), powerpc kernels with -mprofile-kernel can crash in certain scenarios with a trace like below: BUG: Unable to handle kernel instruction fetch (NULL pointer?) Faulting instruction address: 0x00000000 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA PowerNV <snip> NIP [0000000000000000] 0x0 LR [c0080000102c0048] ext4_iomap_end+0x8/0x30 [ext4] Call Trace: iomap_apply+0x20c/0x920 (unreliable) iomap_bmap+0xfc/0x160 ext4_bmap+0xa4/0x180 [ext4] bmap+0x4c/0x80 jbd2_journal_init_inode+0x44/0x1a0 [jbd2] ext4_load_journal+0x440/0x860 [ext4] ext4_fill_super+0x342c/0x3ab0 [ext4] mount_bdev+0x25c/0x290 ext4_mount+0x28/0x50 [ext4] legacy_get_tree+0x4c/0xb0 vfs_get_tree+0x4c/0x130 do_mount+0xa18/0xc50 sys_mount+0x158/0x180 system_call+0x5c/0x68 The NIP points to NULL, or a random location (data even), while the LR always points to the LEP of a function (with an offset of 8), indicating that something went wrong with ftrace. However, ftrace is not necessarily active when such crashes occur. The kernel OOPS sometimes follows a warning from ftrace indicating that some module functions could not be patched with a nop. Other times, if a module is loaded early during boot, instruction patching can fail due to a separate bug, but the error is not reported due to missing error reporting. In all the above cases when instruction patching fails, ftrace will be disabled but certain kernel module functions will be left with default calls to _mcount(). This is not a problem with ELFv1. However, with -mprofile-kernel, the default stub is problematic since it depends on a valid module TOC in r2. If the kernel (or a different module) calls into a function that does not use the TOC, the function won't have a prologue to setup the module TOC. When that function calls into _mcount(), we will end up in the relocation stub that will use the previous TOC, and end up trying to jump into a random location. From the above trace: iomap_apply+0x20c/0x920 [kernel TOC] | V ext4_iomap_end+0x8/0x30 [no GEP == kernel TOC] | V _mcount() stub [uses kernel TOC -> random entry] To address this, let's change over to using the special stub that is used for ftrace_[regs_]caller() for _mcount(). This ensures that we are not dependent on a valid module TOC in r2 for default _mcount() handling. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Tested-by: Qian Cai <cai@lca.pw> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/8affd4298d22099bbd82544fab8185700a6222b1.1587488954.git.naveen.n.rao@linux.vnet.ibm.com
2020-06-02powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocationsNaveen N. Rao1-14/+4
For -mprofile-kernel, we need special handling when generating stubs for ftrace calls such as _mcount(). To faciliate this, we check if a R_PPC64_REL24 relocation is for a symbol named "_mcount()" along with also checking the instruction sequence. The latter is not really required since "_mcount()" is an exported symbol and kernel modules cannot use it. As such, drop the additional checking and simplify the code. This helps unify stub creation for ftrace stubs with -mprofile-kernel and aids in code reuse. Also rename is_mprofile_mcount_callsite() to is_mprofile_ftrace_call() to reflect the checking being done. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7d9c316adfa1fb787ad268bb4691e7e4059ff2d5.1587488954.git.naveen.n.rao@linux.vnet.ibm.com
2020-06-02powerpc/module_64: Consolidate ftrace codeNaveen N. Rao2-39/+33
module_trampoline_target() is only used by ftrace. Move the prototype within the appropriate #ifdef in the header. Also, move the function body to the end of module_64.c so as to consolidate all ftrace code in one place. No functional changes. Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/2527351f65c53c5866068ae130dc34c5d4ee8ad9.1587488954.git.naveen.n.rao@linux.vnet.ibm.com
2020-06-02powerpc/32: Disable KASAN with pages bigger than 16kChristophe Leroy1-2/+2
Mapping of early shadow area is implemented by using a single static page table having all entries pointing to the same early shadow page. The shadow area must therefore occupy full PGD entries. The shadow area has a size of 128MB starting at 0xf8000000. With 4k pages, a PGD entry is 4MB With 16k pages, a PGD entry is 64MB With 64k pages, a PGD entry is 1GB which is too big. Until we rework the early shadow mapping, disable KASAN when the page size is too big. Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support") Cc: stable@vger.kernel.org # v5.2+ Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7195fcde7314ccbf7a081b356084a69d421b10d4.1590660977.git.christophe.leroy@csgroup.eu
2020-06-02powerpc/uaccess: Don't set KUEP by default on book3s/32Christophe Leroy1-1/+1
On book3s/32, KUEP is an heavy process as it requires to set/unset the NX bit in each of the 12 user segments everytime the kernel is entered/exited from/to user space. Don't select KUEP by default on book3s/32. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1492bb150c1aaa53d99a604b49992e60ea20cd5f.1586962582.git.christophe.leroy@c-s.fr
2020-06-02powerpc/uaccess: Don't set KUAP by default on book3s/32Christophe Leroy1-1/+1
On book3s/32, KUAP is an heavy process as it requires to determine which segments are impacted and unlock/lock each of them. And since the implementation of user_access_begin/end, it is even worth for the time being because unlike __get_user(), user_access_begin doesn't make difference between read and write and unlocks access also for read allthought that's unneeded on book3s/32. As shown by the size of a kernel built with KUAP and one without, the overhead is 64k bytes of code. As a comparison a similar build on an 8xx has an overhead of only 8k bytes of code. text data bss dec hex filename 7230416 1425868 837376 9493660 90dc9c vmlinux.kuap6xx 7165012 1425548 837376 9427936 8fdbe0 vmlinux.nokuap6xx 6519796 1960028 477464 8957288 88ad68 vmlinux.kuap8xx 6511664 1959864 477464 8948992 888d00 vmlinux.nokuap8xx Until a more optimised KUAP is implemented on book3s/32, don't select it by default. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/154a99399317b096ac1f04827b9f8d7a9179ddc1.1586962586.git.christophe.leroy@c-s.fr
2020-06-02powerpc/8xx: Reduce time spent in allow_user_access() and friendsChristophe Leroy1-8/+8
To enable/disable kernel access to user space, the 8xx has to modify the properties of access group 1. This is done by writing predefined values into SPRN_Mx_AP registers. As of today, a __put_user() gives: 00000d64 <my_test>: d64: 3d 20 4f ff lis r9,20479 d68: 61 29 ff ff ori r9,r9,65535 d6c: 7d 3a c3 a6 mtspr 794,r9 d70: 39 20 00 00 li r9,0 d74: 90 83 00 00 stw r4,0(r3) d78: 3d 20 6f ff lis r9,28671 d7c: 61 29 ff ff ori r9,r9,65535 d80: 7d 3a c3 a6 mtspr 794,r9 d84: 4e 80 00 20 blr Because only groups 0 and 1 are used, the definition of groups 2 to 15 doesn't matter. By setting unused bits to 0 instead on 1, one instruction is removed for each lock and unlock action: 00000d5c <my_test>: d5c: 3d 20 40 00 lis r9,16384 d60: 7d 3a c3 a6 mtspr 794,r9 d64: 39 20 00 00 li r9,0 d68: 90 83 00 00 stw r4,0(r3) d6c: 3d 20 60 00 lis r9,24576 d70: 7d 3a c3 a6 mtspr 794,r9 d74: 4e 80 00 20 blr Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/57425c33dd72f292b1a23570244b81419072a7aa.1586945153.git.christophe.leroy@c-s.fr
2020-06-02powerpc/entry32: Blacklist exception exit points for kprobe.Christophe Leroy1-1/+8
kprobe does not handle events happening in real mode. The very last part of exception exits cannot support a trap. Blacklist them from kprobe. While we are at it, remove exc_exit_start symbol which is not used to avoid having to blacklist it. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/098b0fd3f6299aa1bd692bd576bd7012c84608de.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/entry32: Blacklist syscall exit points for kprobe.Christophe Leroy1-0/+3
kprobe does not handle events happening in real mode. The very last part of syscall cannot support a trap. Add a symbol syscall_exit_finish to identify that part and blacklist it from kprobe. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/23eddf49abb03d1359fa0be4206998eb3800f42c.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/entry32: Blacklist exception entry points for kprobe.Christophe Leroy1-15/+22
kprobe does not handle events happening in real mode. As exception entry points are running with MMU disabled, blacklist them. The handling of TLF_NAPPING and TLF_SLEEPING is moved before the CONFIG_TRACE_IRQFLAGS which contains 'reenable_mmu' because from there kprobe will be possible as the kernel will run with MMU enabled. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/f61ac599855e674ebb592464d0ea32a3ba9c6644.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/32: Blacklist functions running with MMU disabled for kprobeChristophe Leroy10-0/+16
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/3bf57066d05518644dee0840af69d36ab5086729.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/rtas: Remove machine_check_in_rtas()Christophe Leroy2-7/+1
machine_check_in_rtas() is just a trap. Do the trap directly in the machine check exception handler. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/78899f40f89cb3c4f69bdff7f04eb6ec7cb753d5.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/32s: Blacklist functions running with MMU disabled for kprobeChristophe Leroy1-0/+6
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/dabed523c1b8955dd425152ce260b390053e727a.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/32s: Make local symbols non visible in hash_low.Christophe Leroy1-13/+13
In hash_low.S, a lot of named local symbols are used instead of numbers to ease code readability. However, they don't need to be visible. In order to ease blacklisting of functions running with MMU disabled for kprobe, rename the symbols to .Lsymbols in order to hide them as if they were numbered labels. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/90c430d9e0f7af772a58aaeaf17bcc6321265340.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/mem: Blacklist flush_dcache_icache_phys() for kprobeChristophe Leroy1-0/+2
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/eaab3bff961c3bfe149f1d0bd3593291ef939dcc.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/powermac: Blacklist functions running with MMU disabled for kprobeChristophe Leroy2-1/+6
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/6316e8883753499073f47301857e4e88b73c3ddd.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/83xx: Blacklist mpc83xx_deep_resume() for kprobeChristophe Leroy1-0/+1
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/3ac4ab8dd7008b9706d9228a60645a1756fa84bf.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/82xx: Blacklist pq2_restart() for kprobeChristophe Leroy1-0/+3
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/5dca36682383577a3c2b2bca4d577e8654944461.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/52xx: Blacklist functions running with MMU disabled for kprobeChristophe Leroy1-0/+2
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1ae02b6637b87fc5aaa1d5012c3e2cb30e62b4a3.1585670437.git.christophe.leroy@c-s.fr
2020-06-02powerpc/kprobes: Use probe_address() to read instructionsChristophe Leroy1-3/+7
In order to avoid Oopses, use probe_address() to read the instruction at the address where the trap happened. Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/7f24b5961a6839ff01df792816807f74ff236bf6.1582567319.git.christophe.leroy@c-s.fr
2020-06-02powerpc/configs: Add LIBNVDIMM to ppc64_defconfigMichael Neuling1-0/+1
This gives us OF_PMEM which is useful in mambo. This adds 153K to the text of ppc64le_defconfig which 0.8% of the total text. LIBNVDIMM text data bss dec hex Without 18574833 5518150 1539240 25632223 1871ddf With 18727834 5546206 1539368 25813408 189e1a0 Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200519043009.3081885-1-mikey@neuling.org
2020-06-02powerpc/rtas: Implement reentrant rtas callLeonardo Bras5-11/+98
Implement rtas_call_reentrant() for reentrant rtas-calls: "ibm,int-on", "ibm,int-off",ibm,get-xive" and "ibm,set-xive". On LoPAPR Version 1.1 (March 24, 2016), from 7.3.10.1 to 7.3.10.4, items 2 and 3 say: 2 - For the PowerPC External Interrupt option: The * call must be reentrant to the number of processors on the platform. 3 - For the PowerPC External Interrupt option: The * argument call buffer for each simultaneous call must be physically unique. So, these rtas-calls can be called in a lockless way, if using a different buffer for each cpu doing such rtas call. For this, it was suggested to add the buffer (struct rtas_args) in the PACA struct, so each cpu can have it's own buffer. The PACA struct received a pointer to rtas buffer, which is allocated in the memory range available to rtas 32-bit. Reentrant rtas calls are useful to avoid deadlocks in crashing, where rtas-calls are needed, but some other thread crashed holding the rtas.lock. This is a backtrace of a deadlock from a kdump testing environment: #0 arch_spin_lock #1 lock_rtas () #2 rtas_call (token=8204, nargs=1, nret=1, outputs=0x0) #3 ics_rtas_mask_real_irq (hw_irq=4100) #4 machine_kexec_mask_interrupts #5 default_machine_crash_shutdown #6 machine_crash_shutdown #7 __crash_kexec #8 crash_kexec #9 oops_end Signed-off-by: Leonardo Bras <leobras.c@gmail.com> [mpe: Move under #ifdef PSERIES to avoid build breakage] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200518234245.200672-3-leobras.c@gmail.com
2020-06-02powerpc/rtas: Move type/struct definitions from rtas.h into rtas-types.hLeonardo Bras2-117/+125
In order to get any rtas* struct into other headers, including rtas.h may cause a lot of errors, regarding include dependency needed for inline functions. Create rtas-types.h and move there all type/struct definitions from rtas.h, then include rtas-types.h into rtas.h. Also, as suggested by checkpath.pl, replace uint8_t for u8. Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200518234245.200672-2-leobras.c@gmail.com
2020-06-02powerpc/crash: Use NMI context for printk when starting to crashLeonardo Bras1-0/+3
Currently, if printk lock (logbuf_lock) is held by other thread during crash, there is a chance of deadlocking the crash on next printk, and blocking a possibly desired kdump. At the start of default_machine_crash_shutdown, make printk enter NMI context, as it will use per-cpu buffers to store the message, and avoid locking logbuf_lock. Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200512214533.93878-1-leobras.c@gmail.com
2020-06-02powerpc/kernel: Enables memory hot-remove after reboot on pseries guestsLeonardo Bras2-2/+8
While providing guests, it's desirable to resize it's memory on demand. By now, it's possible to do so by creating a guest with a small base memory, hot-plugging all the rest, and using 'movable_node' kernel command-line parameter, which puts all hot-plugged memory in ZONE_MOVABLE, allowing it to be removed whenever needed. But there is an issue regarding guest reboot: If memory is hot-plugged, and then the guest is rebooted, all hot-plugged memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal. It usually prevents this memory to be hot-removed from the guest. It's possible to use device-tree information to fix that behavior, as it stores flags for LMB ranges on ibm,dynamic-memory-vN. It involves marking each memblock with the correct flags as hotpluggable memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if 'movable_node' is passed. For carrying such information, the new flag DRCONF_MEM_HOTREMOVABLE was proposed and accepted into Power Architecture documentation. This flag should be: - true (b=1) if the hypervisor may want to hot-remove it later, and - false (b=0) if it does not care. During boot, guest kernel reads the device-tree, early_init_drmem_lmb() is called for every added LMBs. Here, checking for this new flag and marking memblocks as hotplugable memory is enough to get the desirable behavior. This should cause no change if 'movable_node' parameter is not passed in kernel command-line. Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> Reviewed-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200402195156.626430-1-leonardo@linux.ibm.com
2020-06-02powerpc/xmon: Show task->thread.regs in process displayMichael Ellerman1-3/+3
Show the address of the tasks regs in the process listing in xmon. The regs should always be on the stack page that we also print the address of, but it's still helpful not to have to find them by hand. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200520111740.953679-1-mpe@ellerman.id.au
2020-06-02powerpc/configs/64s: Enable CONFIG_PRINTK_CALLERMichael Ellerman3-0/+3
This adds the CPU or thread number to printk messages. This helps a lot when deciphering concurrent oopses that have been interleaved. Example output, of PID1 (T1) triggering a warning: [ 1.581678][ T1] WARNING: CPU: 0 PID: 1 at crypto/rsa-pkcs1pad.c:539 pkcs1pad_verify+0x38/0x140 [ 1.581681][ T1] Modules linked in: [ 1.581693][ T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.5.0-rc5-gcc-8.2.0-00121-gf84c2e595927-dirty #1515 [ 1.581700][ T1] NIP: c000000000207d64 LR: c000000000207d3c CTR: c000000000207d2c [ 1.581708][ T1] REGS: c0000000fd2e7560 TRAP: 0700 Not tainted (5.5.0-rc5-gcc-8.2.0-00121-gf84c2e595927-dirty) [ 1.581712][ T1] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44000222 XER: 00040000 Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200520121257.961112-1-mpe@ellerman.id.au
2020-06-02powerpc: Fix misleading small cores printMichael Neuling1-1/+1
Currently when we boot on a big core system, we get this print: [ 0.040500] Using small cores at SMT level This is misleading as we've actually detected big cores. This patch clears up the print to say we've detect big cores but are using small cores for scheduling. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200528230731.1235752-1-mikey@neuling.org
2020-06-02powerpc/fadump: Account for memory_limit while reserving memoryHari Bathini1-1/+1
If the memory chunk found for reserving memory overshoots the memory limit imposed, do not proceed with reserving memory. Default behavior was this until commit 140777a3d8df ("powerpc/fadump: consider reserved ranges while reserving memory") changed it unwittingly. Fixes: 140777a3d8df ("powerpc/fadump: consider reserved ranges while reserving memory") Cc: stable@vger.kernel.org Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/159057266320.22331.6571453892066907320.stgit@hbathini.in.ibm.com
2020-06-02powerpc/crashkernel: Take "mem=" option into accountPingfan Liu1-3/+5
'mem=" option is an easy way to put high pressure on memory during some test. Hence after applying the memory limit, instead of total mem, the actual usable memory should be considered when reserving mem for crashkernel. Otherwise the boot up may experience OOM issue. E.g. it would reserve 4G prior to the change and 512M afterward, if passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. This issue is powerpc specific because it puts higher priority on fadump and kdump reservation than on "mem=". Referring the following code: if (fadump_reserve_mem() == 0) reserve_crashkernel(); ... /* Ensure that total memory size is page-aligned. */ limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE); memblock_enforce_memory_limit(limit); While on other arches, the effect of "mem=" takes a higher priority and pass through memblock_phys_mem_size() before calling reserve_crashkernel(). Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1585749644-4148-1-git-send-email-kernelfans@gmail.com
2020-06-02hw-breakpoints: Fix build warnings with clangRavi Bangoria2-3/+4
kbuild test robot reported some build warnings in the hw_breakpoint code when compiled with clang[1]. Some of them were introduced by the recent powerpc change to add arch_reserve_bp_slot() and arch_release_bp_slot(). Fix them all. kernel/events/hw_breakpoint.c:71:12: warning: no previous prototype for function 'hw_breakpoint_weight' kernel/events/hw_breakpoint.c:216:12: warning: no previous prototype for function 'arch_reserve_bp_slot' kernel/events/hw_breakpoint.c:221:13: warning: no previous prototype for function 'arch_release_bp_slot' kernel/events/hw_breakpoint.c:228:13: warning: no previous prototype for function 'arch_unregister_hw_breakpoint' [1]: https://lore.kernel.org/linuxppc-dev/202005192233.oi9CjRtA%25lkp@intel.com/ Fixes: 29da4f91c0c1 ("powerpc/watchpoint: Don't allow concurrent perf and ptrace events") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> [mpe: Drop extern, flesh out change log, add Fixes tag] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200602041208.128913-1-ravi.bangoria@linux.ibm.com
2020-05-28powerpc/xive: Share the event-queue page with the Hypervisor.Ram Pai1-0/+7
XIVE interrupt controller uses an Event Queue (EQ) to enqueue event notifications when an exception occurs. The EQ is a single memory page provided by the O/S defining a circular buffer, one per server and priority couple. On baremetal, the EQ page is configured with an OPAL call. On pseries, an extra hop is necessary and the guest OS uses the hcall H_INT_SET_QUEUE_CONFIG to configure the XIVE interrupt controller. The XIVE controller being Hypervisor privileged, it will not be allowed to enqueue event notifications for a Secure VM unless the EQ pages are shared by the Secure VM. Hypervisor/Ultravisor still requires support for the TIMA and ESB page fault handlers. Until this is complete, QEMU can use the emulated XIVE device for Secure VMs, option "kernel_irqchip=off" on the QEMU pseries machine. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Reviewed-by: Cedric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200426020518.GC5853@oc0525413822.ibm.com
2020-05-28powerpc/pseries: Update hv-24x7 information after migrationKajol Jain1-0/+3
Function 'read_sys_info_pseries()' is added to get system parameter values like number of sockets and chips per socket. and it gets these details via rtas_call with token "PROCESSOR_MODULE_INFO". Incase lpar migrate from one system to another, system parameter details like chips per sockets or number of sockets might change. So, it needs to be re-initialized otherwise, these values corresponds to previous system values. This patch adds a call to 'read_sys_info_pseries()' from 'post-mobility_fixup()' to re-init the physsockets and physchips values Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200525104308.9814-6-kjain@linux.ibm.com
2020-05-28Documentation/ABI: Add ABI documentation for chips and socketsKajol Jain1-0/+21
Add documentation for the following sysfs files: /sys/devices/hv_24x7/interface/chipspersocket, /sys/devices/hv_24x7/interface/sockets, /sys/devices/hv_24x7/interface/coresperchip Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200525104308.9814-5-kjain@linux.ibm.com
2020-05-28powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show processor detailsKajol Jain1-0/+24
To expose the system dependent parameter like total number of sockets and numbers of chips per socket, patch adds two sysfs files. "sockets" and "chips" are added to /sys/devices/hv_24x7/interface/ of the "hv_24x7" pmu. Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200525104308.9814-4-kjain@linux.ibm.com
2020-05-28powerpc/hv-24x7: Add rtas call in hv-24x7 driver to get processor detailsKajol Jain2-0/+68
For hv_24x7 socket/chip level events, specific chip-id to which the data requested should be added as part of pmu events. But number of chips/socket in the system details are not exposed. Patch implements read_24x7_sys_info() to get system parameter values like number of sockets, cores per chip and chips per socket. Rtas_call with token "PROCESSOR_MODULE_INFO" is used to get these values. Subsequent patch exports these values via sysfs. Patch also make these parameters default to 1. Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200525104308.9814-3-kjain@linux.ibm.com
2020-05-28powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events runKajol Jain1-10/+0
Commit 2b206ee6b0df ("powerpc/perf/hv-24x7: Display change in counter values")' added to print _change_ in the counter value rather then raw value for 24x7 counters. Incase of transactions, the event count is set to 0 at the beginning of the transaction. It also sets the event's prev_count to the raw value at the time of initialization. Because of setting event count to 0, we are seeing some weird behaviour, whenever we run multiple 24x7 events at a time. For example: command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/, hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}" -C 0 -I 1000 sleep 100 1.000121704 120 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 1.000121704 5 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 2.000357733 8 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 2.000357733 10 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 3.000495215 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 3.000495215 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.000641884 56 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 4.000641884 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 5.000791887 18,446,744,073,709,551,616 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ Getting these large values in case we do -I. As we are setting event_count to 0, for interval case, overall event_count is not coming in incremental order. As we may can get new delta lesser then previous count. Because of which when we print intervals, we are getting negative value which create these large values. This patch removes part where we set event_count to 0 in function 'h_24x7_event_read'. There won't be much impact as we do set event->hw.prev_count to the raw value at the time of initialization to print change value. With this patch In power9 platform command#: ./perf stat -e "{hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/, hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/}" -C 0 -I 1000 sleep 100 1.000117685 93 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 1.000117685 1 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 2.000349331 98 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 2.000349331 2 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 3.000495900 131 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 3.000495900 4 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.000645920 204 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ 4.000645920 61 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=1/ 4.284169997 22 hv_24x7/PM_MCS01_128B_RD_DISP_PORT01,chip=0/ Suggested-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200525104308.9814-2-kjain@linux.ibm.com
2020-05-28powerpc/powernv/pci: Sprinkle around some WARN_ON()sOliver O'Halloran1-2/+2
pnv_pci_ioda_configure_bus() should now only ever be called when a device is added to the bus so add a WARN_ON() to the empty bus check. Similarly, pnv_pci_ioda_setup_bus_PE() should only ever be called for an unconfigured PE, so add a WARN_ON() for that case too. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-5-oohall@gmail.com
2020-05-28powerpc/powernv/pci: Reserve the root bus PE during initOliver O'Halloran2-18/+9
Doing it once during boot rather than doing it on the fly and drop the janky populated logic. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-4-oohall@gmail.com
2020-05-28powerpc/powernv/pci: Re-work bus PE configurationOliver O'Halloran1-51/+30
For normal PHBs IODA PEs are handled on a per-bus basis so all the devices on that bus will share a PE. Which PE specificly is determined by the location of the MMIO BARs for the devices on the bus so we can't actually configure the bus PEs until after MMIO resources are allocated. As a result PEs are currently configured by pcibios_setup_bridge(), which is called just before the bridge windows are programmed into the bus' parent bridge. Configuring the bus PE here causes a few problems: 1. The root bus doesn't have a parent bridge so setting up the PE for the root bus requires some hacks. 2. The PELT-V isn't setup correctly because pnv_ioda_set_peltv() assumes that PEs will be configured in root-to-leaf order. This assumption is broken because resource assignment is performed depth-first so the leaf bridges are setup before their parents are. The hack mentioned in 1) results in the "correct" PELT-V for busses immediately below the root port, but not for devices below a switch. 3. It's possible to break the sysfs PCI rescan feature by removing all the devices on a bus. When the last device is removed from a PE its will be de-configured. Rescanning the devices on a bus does not cause the bridge to be reconfigured rendering the devices on that bus unusable. We can address most of these problems by moving the PE setup out of pcibios_setup_bridge() and into pcibios_bus_add_device(). This fixes 1) and 2) because pcibios_bus_add_device() is called on each device in root-to-leaf order so PEs for parent buses will always be configured before their children. It also fixes 3) by ensuring the PE is configured before initialising DMA for the device. In the event the PE was de-configured due to removing all the devices in that PE it will now be reconfigured when a new device is added since there's no dependecy on the bridge_setup() hook being called. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-3-oohall@gmail.com
2020-05-28powerpc/powernv/pci: Add helper to find ioda_pe from BDFNOliver O'Halloran2-0/+11
For each PHB we maintain a reverse-map that can be used to find the PE that a BDFN is currently mapped to. Add a helper for doing this lookup so we can check if a PE has been configured without looking at pdn->pe_number. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200417073508.30356-2-oohall@gmail.com
2020-05-28powerpc/powernv/pci: Add an explaination for PNV_IODA_PE_BUS_ALLOliver O'Halloran1-0/+18
It's pretty obsecure and confused me for a long time so I figured it's worth documenting properly. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200414233502.758-1-oohall@gmail.com
2020-05-28powerpc/powernv: Add a print indicating when an IODA PE is releasedOliver O'Halloran1-0/+2
Quite useful to know in some cases. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200408112213.5549-1-oohall@gmail.com
2020-05-28powerpc/powernv/npu: Move IOMMU group setup into npu-dma.cOliver O'Halloran3-60/+60
The NVlink IOMMU group setup is only relevant to NVLink devices so move it into the NPU containment zone. This let us remove some prototypes in pci.h and staticfy some function definitions. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200406030745.24595-8-oohall@gmail.com