aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2021-04-26riscv: Prepare ptdump for vm layout dynamic addressesAlexandre Ghiti1-12/+61
This is a preparatory patch for sv48 support that will introduce dynamic PAGE_OFFSET. Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...) whose addresses depend on PAGE_OFFSET become dynamic and can't be used to statically initialize the array used by ptdump to identify the different zones of the vm layout. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26Documentation: riscv: Add documentation that describes the VM layoutAlexandre Ghiti2-0/+64
This new document presents the RISC-V virtual memory layout and is based one the x86 one: it describes the different limits of the different regions of the virtual address space. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Move kernel mapping outside of linear mappingAlexandre Ghiti12-36/+182
This is a preparatory patch for relocatable kernel and sv48 support. The kernel used to be linked at PAGE_OFFSET address therefore we could use the linear mapping for the kernel mapping. But the relocated kernel base address will be different from PAGE_OFFSET and since in the linear mapping, two different virtual addresses cannot point to the same physical address, the kernel mapping needs to lie outside the linear mapping so that we don't have to copy it at the same physical offset. The kernel mapping is moved to the last 2GB of the address space, BPF is now always after the kernel and modules use the 2GB memory range right before the kernel, so BPF and modules regions do not overlap. KASLR implementation will simply have to move the kernel in the last 2GB range and just take care of leaving enough space for BPF. In addition, by moving the kernel to the end of the address space, both sv39 and sv48 kernels will be exactly the same without needing to be relocated at runtime. Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> [Palmer: Squash the STRICT_RWX fix, and a !MMU fix] Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26samples/kprobes: Add riscv supportJisheng Zhang1-0/+8
Add riscv specific info dump in both handler_pre() and handler_post(). Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Select HAVE_DYNAMIC_FTRACE when -fpatchable-function-entry is availableNathan Chancellor1-1/+1
clang prior to 13.0.0 does not support -fpatchable-function-entry for RISC-V. clang: error: unsupported option '-fpatchable-function-entry=8' for target 'riscv64-unknown-linux-gnu' To avoid this error, only select HAVE_DYNAMIC_FTRACE when this option is not available. Fixes: afc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") Link: https://github.com/ClangBuiltLinux/linux/issues/1268 Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Workaround mcount name prior to clang-13Nathan Chancellor3-8/+18
Prior to clang 13.0.0, the RISC-V name for the mcount symbol was "mcount", which differs from the GCC version of "_mcount", which results in the following errors: riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_level': main.c:(.text+0xe): undefined reference to `mcount' riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_start': main.c:(.text+0x4e): undefined reference to `mcount' riscv64-linux-gnu-ld: init/main.o: in function `__traceiter_initcall_finish': main.c:(.text+0x92): undefined reference to `mcount' riscv64-linux-gnu-ld: init/main.o: in function `.LBB32_28': main.c:(.text+0x30c): undefined reference to `mcount' riscv64-linux-gnu-ld: init/main.o: in function `free_initmem': main.c:(.text+0x54c): undefined reference to `mcount' This has been corrected in https://reviews.llvm.org/D98881 but the minimum supported clang version is 10.0.1. To avoid build errors and to gain a working function tracer, adjust the name of the mcount symbol for older versions of clang in mount.S and recordmcount.pl. Link: https://github.com/ClangBuiltLinux/linux/issues/1331 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26scripts/recordmcount.pl: Fix RISC-V regex for clangNathan Chancellor1-1/+1
Clang can generate R_RISCV_CALL_PLT relocations to _mcount: $ llvm-objdump -dr build/riscv/init/main.o | rg mcount 000000000000000e: R_RISCV_CALL_PLT _mcount 000000000000004e: R_RISCV_CALL_PLT _mcount After this, the __start_mcount_loc section is properly generated and function tracing still works. Link: https://github.com/ClangBuiltLinux/linux/issues/1331 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Use $(LD) instead of $(CC) to link vDSONathan Chancellor1-8/+4
Currently, the VDSO is being linked through $(CC). This does not match how the rest of the kernel links objects, which is through the $(LD) variable. When linking with clang, there are a couple of warnings about flags that will not be used during the link: clang-12: warning: argument unused during compilation: '-no-pie' [-Wunused-command-line-argument] clang-12: warning: argument unused during compilation: '-pg' [-Wunused-command-line-argument] '-no-pie' was added in commit 85602bea297f ("RISC-V: build vdso-dummy.o with -no-pie") to override '-pie' getting added to the ld command from distribution versions of GCC that enable PIE by default. It is technically no longer needed after commit c2c81bb2f691 ("RISC-V: Fix the VDSO symbol generaton for binutils-2.35+"), which removed vdso-dummy.o in favor of generating vdso-syms.S from vdso.so with $(NM) but this also resolves the issue in case it ever comes back due to having full control over the $(LD) command. '-pg' is for function tracing, it is not used during linking as clang states. These flags could be removed/filtered to fix the warnings but it is easier to just match the rest of the kernel and use $(LD) directly for linking. See commits fe00e50b2db8 ("ARM: 8858/1: vdso: use $(LD) instead of $(CC) to link VDSO") 691efbedc60d ("arm64: vdso: use $(LD) instead of $(CC) to link VDSO") 2ff906994b6c ("MIPS: VDSO: Use $(LD) instead of $(CC) to link VDSO") 2b2a25845d53 ("s390/vdso: Use $(LD) instead of $(CC) to link vDSO") for more information. The flags are converted to linker flags and '--eh-frame-hdr' is added to match what is added by GCC implicitly, which can be seen by adding '-v' to GCC's invocation. Additionally, since this area is being modified, use the $(OBJCOPY) variable instead of an open coded $(CROSS_COMPILE)objcopy so that the user's choice of objcopy binary is respected. Link: https://github.com/ClangBuiltLinux/linux/issues/803 Link: https://github.com/ClangBuiltLinux/linux/issues/970 Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: sifive: Apply errata "cip-1200" patchVincent Chen4-2/+40
For certain SiFive CPUs, "sfence.vma addr" cannot exactly flush addr from TLB in the particular cases. The details could be found here: https://sifive.cdn.prismic.io/sifive/167a1a56-03f4-4615-a79e-b2a86153148f_FU740_errata_20210205.pdf In order to ensure the functionality, this patch uses the Alternative scheme to replace all "sfence.vma addr" with "sfence.vma" at runtime. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: sifive: Apply errata "cip-453" patchVincent Chen6-3/+94
Add sign extension to the $badaddr before addressing the instruction page fault and instruction access fault to workaround the issue "cip-453". To avoid affecting the existing code sequence, this patch will creates two trampolines to add sign extension to the $badaddr. By the "alternative" mechanism, these two trampolines will replace the original exception handler of instruction page fault and instruction access fault in the excp_vect_table. In this case, only the specific SiFive CPU core jumps to the do_page_fault and do_trap_insn_fault through these two trampolines. Other CPUs are not affected. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: sifive: Add SiFive alternative portsVincent Chen7-0/+89
Add required ports of the Alternative scheme for SiFive. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Introduce alternative mechanism to apply errata solutionVincent Chen14-0/+300
Introduce the "alternative" mechanism from ARM64 and x86 to apply the CPU vendors' errata solution at runtime. The main purpose of this patch is to provide a framework. Therefore, the implementation is quite basic for now so that some scenarios could not use this schemei, such as patching code to a module, relocating the patching code and heterogeneous CPU topology. Users could use the macro ALTERNATIVE to apply an errata to the existing code flow. In the macro ALTERNATIVE, users need to specify the manufacturer information(vendorid, archid, and impid) for this errata. Therefore, kernel will know this errata is suitable for which CPU core. During the booting procedure, kernel will select the errata required by the CPU core and then patch it. It means that the kernel only applies the errata to the specified CPU core. In this case, the vendor's errata does not affect each other at runtime. The above patching procedure only occurs during the booting phase, so we only take the overhead of the "alternative" mechanism once. This "alternative" mechanism is enabled by default to ensure that all required errata will be applied. However, users can disable this feature by the Kconfig "CONFIG_RISCV_ERRATA_ALTERNATIVE". Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-04-26riscv: Add 3 SBI wrapper functions to get cpu manufacturer informationVincent Chen2-0/+18
Add 3 wrapper functions to get vendor id, architecture id and implement id from M-mode Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-29kbuild: buildtar: add riscv supportCarlos de Paula1-0/+8
Make 'make tar-pkg' and 'tarbz2-pkg' work on riscv. Signed-off-by: Carlos de Paula <me@carlosedp.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-29riscv: Cleanup KASAN_VMALLOC supportAlexandre Ghiti1-41/+18
When KASAN vmalloc region is populated, there is no userspace process and the page table in use is swapper_pg_dir, so there is no need to read SATP. Then we can use the same scheme used by kasan_populate_p*d functions to go through the page table, which harmonizes the code. In addition, make use of set_pgd that goes through all unused page table levels, contrary to p*d_populate functions, which makes this function work whatever the number of page table levels. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-29RISC-V: Don't print SBI version for all detected extensionsAnup Patel1-3/+3
The sbi_init() already prints SBI version before detecting various SBI extensions so we don't need to print SBI version for all detected SBI extensions. Signed-off-by: Anup Patel <anup.patel@wdc.com> Reviewed-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-16riscv: Enable generic clockevent broadcastGuo Ren2-0/+18
When percpu-timers are stopped by deep power saving mode, we need system timer help to broadcast IPI_TIMER. This is first introduced by broken x86 hardware, where the local apic timer stops in C3 state. But many other architectures(powerpc, mips, arm, hexagon, openrisc, sh) have supported the infrastructure to deal with Power Management issues. Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Reviewed-by: Anup Patel <anup@brainfault.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-09riscv: Add ARCH_HAS_FORTIFY_SOURCEKefeng Wang2-0/+6
FORTIFY_SOURCE could detect various overflows at compile and run time. ARCH_HAS_FORTIFY_SOURCE means that the architecture can be built and run with CONFIG_FORTIFY_SOURCE. Select it in RISCV. See more about this feature from commit 6974f0c4555e ("include/linux/string.h: add the option of fortified string.h functions"). Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-09riscv: Add support for memtestKefeng Wang2-3/+5
The riscv [rv32_]defconfig enabled CONFIG_MEMTEST, but memtest feature is not supported in RISCV. Add early_memtest() to support for memtest. Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
2021-03-05Linux 5.12-rc2Linus Torvalds1-1/+1
2021-03-05RDMA/rxe: Fix errant WARN_ONCE in rxe_completer()Bob Pearson1-32/+23
In rxe_comp.c in rxe_completer() the function free_pkt() did not clear skb which triggered a warning at 'done:' and could possibly at 'exit:'. The WARN_ONCE() calls are not actually needed. The call to free_pkt() is moved to the end to clearly show that all skbs are freed. Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()") Link: https://lore.kernel.org/r/20210304192048.2958-1-rpearson@hpe.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-05RDMA/rxe: Fix extra deref in rxe_rcv_mcast_pkt()Bob Pearson1-24/+35
rxe_rcv_mcast_pkt() dropped a reference to ib_device when no error occurred causing an underflow on the reference counter. This code is cleaned up to be clearer and easier to read. Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()") Link: https://lore.kernel.org/r/20210304192048.2958-1-rpearson@hpe.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-05RDMA/rxe: Fix missed IB reference counting in loopbackBob Pearson1-1/+9
When the noted patch below extending the reference taken by rxe_get_dev_from_net() in rxe_udp_encap_recv() until each skb is freed it was not matched by a reference in the loopback path resulting in underflows. Fixes: 899aba891cab ("RDMA/rxe: Fix FIXME in rxe_udp_encap_recv()") Link: https://lore.kernel.org/r/20210304192048.2958-1-rpearson@hpe.com Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-05io_uring: don't restrict issue_flags for io_openatPavel Begunkov1-1/+1
45d189c606292 ("io_uring: replace force_nonblock with flags") did something strange for io_openat() slicing all issue_flags but IO_URING_F_NONBLOCK. Not a bug for now, but better to just forward the flags. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io_uring: make SQPOLL thread parking sanerJens Axboe1-35/+30
We have this weird true/false return from parking, and then some of the callers decide to look at that. It can lead to unbalanced parks and sqd locking. Have the callers check the thread status once it's parked. We know we have the lock at that point, so it's either valid or it's NULL. Fix race with parking on thread exit. We need to be careful here with ordering of the sdq->lock and the IO_SQ_THREAD_SHOULD_PARK bit. Rename sqd->completion to sqd->parked to reflect that this is the only thing this completion event doesn. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io-wq: kill hashed waitqueue before manager exitsJens Axboe1-4/+5
If we race with shutting down the io-wq context and someone queueing a hashed entry, then we can exit the manager with it armed. If it then triggers after the manager has exited, we can have a use-after-free where io_wqe_hash_wake() attempts to wake a now gone manager process. Move the killing of the hashed write queue into the manager itself, so that we know we've killed it before the task exits. Fixes: e941894eae31 ("io-wq: make buffered file write hashed work map per-ctx") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io_uring: clear IOCB_WAITQ for non -EIOCBQUEUED returnJens Axboe1-0/+1
The callback can only be armed, if we get -EIOCBQUEUED returned. It's important that we clear the WAITQ bit for other cases, otherwise we can queue for async retry and filemap will assume that we're armed and return -EAGAIN instead of just blocking for the IO. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io_uring: don't keep looping for more events if we can't flush overflowJens Axboe1-3/+12
It doesn't make sense to wait for more events to come in, if we can't even flush the overflow we already have to the ring. Return -EBUSY for that condition, just like we do for attempts to submit with overflow pending. Cc: stable@vger.kernel.org # 5.11 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05io_uring: move to using create_io_thread()Jens Axboe3-109/+54
This allows us to do task creation and setup without needing to use completions to try and synchronize with the starting thread. Get rid of the old io_wq_fork_thread() wrapper, and the 'wq' and 'worker' startup completion events - we can now do setup before the task is running. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-05nvmet: model_number must be immutable once setMax Gurtovoy4-45/+50
In case we have already established connection to nvmf target, it shouldn't be allowed to change the model_number. E.g. if someone will identify ctrl and get model_number of "my_model" later on will change the model_numbel via configfs to "my_new_model" this will break the NVMe specification for "Get Log Page – Persistent Event Log" that refers to Model Number as: "This field contains the same value as reported in the Model Number field of the Identify Controller data structure, bytes 63:24." Although it doesn't mentioned explicitly that this field can't be changed, we can assume it. So allow setting this field only once: using configfs or in the first identify ctrl operation. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-05nvme-fabrics: fix kato initializationMartin George1-1/+4
Currently kato is initialized to NVME_DEFAULT_KATO for both discovery & i/o controllers. This is a problem specifically for non-persistent discovery controllers since it always ends up with a non-zero kato value. Fix this by initializing kato to zero instead, and ensuring various controllers are assigned appropriate kato values as follows: non-persistent controllers - kato set to zero persistent controllers - kato set to NVMF_DEV_DISC_TMO (or any positive int via nvme-cli) i/o controllers - kato set to NVME_DEFAULT_KATO (or any positive int via nvme-cli) Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-05nvme-hwmon: Return error code when registration failsDaniel Wagner1-0/+1
The hwmon pointer wont be NULL if the registration fails. Though the exit code path will assign it to ctrl->hwmon_device. Later nvme_hwmon_exit() will try to free the invalid pointer. Avoid this by returning the error code from hwmon_device_register_with_info(). Fixes: ed7770f66286 ("nvme/hwmon: rework to avoid devm allocation") Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-05nvme-pci: add quirks for Lexar 256GB SSDPascal Terjan1-0/+3
Add the NVME_QUIRK_NO_NS_DESC_LIST and NVME_QUIRK_IGNORE_DEV_SUBNQN quirks for this buggy device. Reported and tested in https://bugs.mageia.org/show_bug.cgi?id=28417 Signed-off-by: Pascal Terjan <pterjan@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-05nvme-pci: mark Kingston SKC2000 as not supporting the deepest power stateZoltán Böszörményi1-0/+2
My 2TB SKC2000 showed the exact same symptoms that were provided in 538e4a8c57 ("nvme-pci: avoid the deepest sleep state on Kingston A2000 SSDs"), i.e. a complete NVME lockup that needed cold boot to get it back. According to some sources, the A2000 is simply a rebadged SKC2000 with a slightly optimized firmware. Adding the SKC2000 PCI ID to the quirk list with the same workaround as the A2000 made my laptop survive a 5 hours long Yocto bootstrap buildfest which reliably triggered the SSD lockup previously. Signed-off-by: Zoltán Böszörményi <zboszor@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-03-05nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.Julian Einwag1-1/+2
The kernel fails to fully detect these SSDs, only the character devices are present: [ 10.785605] nvme nvme0: pci function 0000:04:00.0 [ 10.876787] nvme nvme1: pci function 0000:81:00.0 [ 13.198614] nvme nvme0: missing or invalid SUBNQN field. [ 13.198658] nvme nvme1: missing or invalid SUBNQN field. [ 13.206896] nvme nvme0: Shutdown timeout set to 20 seconds [ 13.215035] nvme nvme1: Shutdown timeout set to 20 seconds [ 13.225407] nvme nvme0: 16/0/0 default/read/poll queues [ 13.233602] nvme nvme1: 16/0/0 default/read/poll queues [ 13.239627] nvme nvme0: Identify Descriptors failed (8194) [ 13.246315] nvme nvme1: Identify Descriptors failed (8194) Adding the NVME_QUIRK_NO_NS_DESC_LIST fixes this problem. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=205679 Signed-off-by: Julian Einwag <jeinwag-nvme@marcapo.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-04scsi: iscsi: Verify lengths on passthrough PDUsChris Leech1-0/+9
Open-iSCSI sends passthrough PDUs over netlink, but the kernel should be verifying that the provided PDU header and data lengths fall within the netlink message to prevent accessing beyond that in memory. Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: iscsi: Ensure sysfs attributes are limited to PAGE_SIZEChris Leech2-83/+90
As the iSCSI parameters are exported back through sysfs, it should be enforcing that they never are more than PAGE_SIZE (which should be more than enough) before accepting updates through netlink. Change all iSCSI sysfs attributes to use sysfs_emit(). Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: iscsi: Restrict sessions and handles to admin capabilitiesLee Duncan1-0/+6
Protect the iSCSI transport handle, available in sysfs, by requiring CAP_SYS_ADMIN to read it. Also protect the netlink socket by restricting reception of messages to ones sent with CAP_SYS_ADMIN. This disables normal users from being able to end arbitrary iSCSI sessions. Cc: stable@vger.kernel.org Reported-by: Adam Nichols <adam@grimm-co.com> Reviewed-by: Chris Leech <cleech@redhat.com> Reviewed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04kernel: provide create_io_thread() helperJens Axboe2-0/+32
Provide a generic helper for setting up an io_uring worker. Returns a task_struct so that the caller can do whatever setup is needed, then call wake_up_new_task() to kick it into gear. Add a kernel_clone_args member, io_thread, which tells copy_process() to mark the task with PF_IO_WORKER. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: reliably cancel linked timeoutsPavel Begunkov1-0/+1
Linked timeouts are fired asynchronously (i.e. soft-irq), and use generic cancellation paths to do its stuff, including poking into io-wq. The problem is that it's racy to access tctx->io_wq, as io_uring_task_cancel() and others may be happening at this exact moment. Mark linked timeouts with REQ_F_INLIFGHT for now, making sure there are no timeouts before io-wq destraction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04io_uring: cancel-match based on flagsPavel Begunkov1-2/+2
Instead of going into request internals, like checking req->file->f_op, do match them based on REQ_F_INFLIGHT, it's set only when we want it to be reliably cancelled. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-04dm verity: fix FEC for RS roots unaligned to block sizeMilan Broz1-11/+12
Optional Forward Error Correction (FEC) code in dm-verity uses Reed-Solomon code and should support roots from 2 to 24. The error correction parity bytes (of roots lengths per RS block) are stored on a separate device in sequence without any padding. Currently, to access FEC device, the dm-verity-fec code uses dm-bufio client with block size set to verity data block (usually 4096 or 512 bytes). Because this block size is not divisible by some (most!) of the roots supported lengths, data repair cannot work for partially stored parity bytes. This fix changes FEC device dm-bufio block size to "roots << SECTOR_SHIFT" where we can be sure that the full parity data is always available. (There cannot be partial FEC blocks because parity must cover whole sectors.) Because the optional FEC starting offset could be unaligned to this new block size, we have to use dm_bufio_set_sector_offset() to configure it. The problem is easily reproduced using veritysetup, e.g. for roots=13: # create verity device with RS FEC dd if=/dev/urandom of=data.img bs=4096 count=8 status=none veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=13 | awk '/^Root hash/{ print $3 }' >roothash # create an erasure that should be always repairable with this roots setting dd if=/dev/zero of=data.img conv=notrunc bs=1 count=8 seek=4088 status=none # try to read it through dm-verity veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=13 $(cat roothash) dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer # wait for possible recursive recovery in kernel udevadm settle veritysetup close test With this fix, errors are properly repaired. device-mapper: verity-fec: 7:1: FEC 0: corrected 8 errors ... Without it, FEC code usually ends on unrecoverable failure in RS decoder: device-mapper: verity-fec: 7:1: FEC 0: failed to correct: -74 ... This problem is present in all kernels since the FEC code's introduction (kernel 4.5). It is thought that this problem is not visible in Android ecosystem because it always uses a default RS roots=2. Depends-on: a14e5ec66a7a ("dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size") Signed-off-by: Milan Broz <gmazyland@gmail.com> Tested-by: Jérôme Carretero <cJ-ko@zougloub.eu> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Cc: stable@vger.kernel.org # 4.5+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-03-04dm bufio: subtract the number of initial sectors in dm_bufio_get_device_sizeMikulas Patocka1-0/+4
dm_bufio_get_device_size returns the device size in blocks. Before returning the value, we must subtract the nubmer of starting sectors. The number of starting sectors may not be divisible by block size. Note that currently, no target is using dm_bufio_set_sector_offset and dm_bufio_get_device_size simultaneously, so this change has no effect. However, an upcoming dm-verity-fec fix needs this change. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2021-03-04btrfs: zoned: do not account freed region of read-only block group as zone_unusableNaohiro Aota1-1/+6
We migrate zone unusable bytes to read-only bytes when a block group is set to read-only, and account all the free region as bytes_readonly. Thus, we should not increase block_group->zone_unusable when the block group is read-only. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-04btrfs: zoned: use sector_t for zone sectorsNaohiro Aota1-2/+2
We need to use sector_t for zone_sectors, or it would set the zone size to zero when the size >= 4GB (= 2^24 sectors) by shifting the zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to 8GiB. Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-04tracing: Fix comment about the trace_event_call flagsSteven Rostedt (VMware)1-9/+2
In the declaration of the struct trace_event_call, the flags has the bits defined in the comment above it. But these bits are also defined by the TRACE_EVENT_FL_* enums just above the declaration of the struct. As the comment about the flags in the struct has become stale and incorrect, just replace it with a reference to the TRACE_EVENT_FL_* enum above. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04tracing: Skip selftests if tracing is disabledSteven Rostedt (VMware)1-0/+6
If tracing is disabled for some reason (traceoff_on_warning, command line, etc), the ftrace selftests are guaranteed to fail, as their results are defined by trace data in the ring buffers. If the ring buffers are turned off, the tests will fail, due to lack of data. Because tracing being disabled is for a specific reason (warning, user decided to, etc), it does not make sense to enable tracing to run the self tests, as the test output may corrupt the reason for the tracing to be disabled. Instead, simply skip the self tests and report that they are being skipped due to tracing being disabled. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04tracing: Fix memory leak in __create_synth_event()Vamshi K Sthambamkadi1-1/+3
kmemleak report: unreferenced object 0xc5a6f708 (size 8): comm "ftracetest", pid 1209, jiffies 4294911500 (age 6.816s) hex dump (first 8 bytes): 00 c1 3d 60 14 83 1f 8a ..=`.... backtrace: [<f0aa4ac4>] __kmalloc_track_caller+0x2a6/0x460 [<7d3d60a6>] kstrndup+0x37/0x70 [<45a0e739>] argv_split+0x1c/0x120 [<c17982f8>] __create_synth_event+0x192/0xb00 [<0708b8a3>] create_synth_event+0xbb/0x150 [<3d1941e1>] create_dyn_event+0x5c/0xb0 [<5cf8b9e3>] trace_parse_run_command+0xa7/0x140 [<04deb2ef>] dyn_event_write+0x10/0x20 [<8779ac95>] vfs_write+0xa9/0x3c0 [<ed93722a>] ksys_write+0x89/0xc0 [<b9ca0507>] __ia32_sys_write+0x15/0x20 [<7ce02d85>] __do_fast_syscall_32+0x45/0x80 [<cb0ecb35>] do_fast_syscall_32+0x29/0x60 [<2467454a>] do_SYSENTER_32+0x15/0x20 [<9beaa61d>] entry_SYSENTER_32+0xa9/0xfc unreferenced object 0xc5a6f078 (size 8): comm "ftracetest", pid 1209, jiffies 4294911500 (age 6.816s) hex dump (first 8 bytes): 08 f7 a6 c5 00 00 00 00 ........ backtrace: [<bbac096a>] __kmalloc+0x2b6/0x470 [<aa2624b4>] argv_split+0x82/0x120 [<c17982f8>] __create_synth_event+0x192/0xb00 [<0708b8a3>] create_synth_event+0xbb/0x150 [<3d1941e1>] create_dyn_event+0x5c/0xb0 [<5cf8b9e3>] trace_parse_run_command+0xa7/0x140 [<04deb2ef>] dyn_event_write+0x10/0x20 [<8779ac95>] vfs_write+0xa9/0x3c0 [<ed93722a>] ksys_write+0x89/0xc0 [<b9ca0507>] __ia32_sys_write+0x15/0x20 [<7ce02d85>] __do_fast_syscall_32+0x45/0x80 [<cb0ecb35>] do_fast_syscall_32+0x29/0x60 [<2467454a>] do_SYSENTER_32+0x15/0x20 [<9beaa61d>] entry_SYSENTER_32+0xa9/0xfc In __create_synth_event(), while iterating field/type arguments, the argv_split() will return array of atleast 2 elements even when zero arguments(argc=0) are passed. for e.g. when there is double delimiter or string ends with delimiter To fix call argv_free() even when argc=0. Link: https://lkml.kernel.org/r/20210304094521.GA1826@cosmos Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04ring-buffer: Add a little more information and a WARN when time stamp going backwards is detectedSteven Rostedt (VMware)1-3/+7
When the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is enabled, and the time stamps are detected as not being valid, it reports information about the write stamp, but does not show the before_stamp which is still useful information. Also, it should give a warning once, such that tests detect this happening. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-04ring-buffer: Force before_stamp and write_stamp to be different on discardSteven Rostedt (VMware)1-0/+11
Part of the logic of the new time stamp code depends on the before_stamp and the write_stamp to be different if the write_stamp does not match the last event on the buffer, as it will be used to calculate the delta of the next event written on the buffer. The discard logic depends on this, as the next event to come in needs to inject a full timestamp as it can not rely on the last event timestamp in the buffer because it is unknown due to events after it being discarded. But by changing the write_stamp back to the time before it, it forces the next event to use a full time stamp, instead of relying on it. The issue came when a full time stamp was used for the event, and rb_time_delta() returns zero in that case. The update to the write_stamp (which subtracts delta) made it not change. Then when the event is removed from the buffer, because the before_stamp and write_stamp still match, the next event written would calculate its delta from the write_stamp, but that would be wrong as the write_stamp is of the time of the event that was discarded. In the case that the delta change being made to write_stamp is zero, set the before_stamp to zero as well, and this will force the next event to inject a full timestamp and not use the current write_stamp. Cc: stable@vger.kernel.org Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>