aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-sqlite.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2021-10-27coresight: trbe: Prohibit trace before disabling TRBESuzuki K Poulose2-1/+12
When the TRBE generates an IRQ, we stop the TRBE, collect the trace and then reprogram the TRBE with the updated buffer pointers, whenever possible. We might also leave the TRBE disabled, if there is not enough space left in the buffer. However, we do not touch the ETE at all during all of this. This means the ETE is only disabled when the event is disabled later (via irq_work). This is incorrect, as the ETE trace is still ON without actually being captured and may be routed to the ATB (even if it is for a short duration). So, we move the CPU into trace prohibited state always before disabling the TRBE, upon entering the IRQ handler. The state is restored if the TRBE is enabled back. Otherwise the trace remains prohibited. Since, the ETM/ETE driver now controls the TRFCR_EL1 per session, the tracing can be restored/enabled back when the event is rescheduled in. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210923143919.2944311-6-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: End the AUX handle on truncationSuzuki K Poulose1-10/+16
When we detect that there isn't enough space left to start a meaningful session, we disable the TRBE, marking the buffer as TRUNCATED. But we delay the notification to the perf layer by perf_aux_output_end() until the event is scheduled out, triggered from the kernel perf layer. This will cause significant black outs in the trace. Now that the CoreSight PMU layer can handle a closed "AUX" handle properly, we can close the handle as soon as we detect the case, allowing the userspace to collect and re-enable the event. Also, while in the IRQ handler, move the irq_work_run() after we have updated the handle, to make sure the "TRUNCATED" flag causes the event to be disabled as soon as possible. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210923143919.2944311-5-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: Do not truncate buffer on IRQSuzuki K Poulose1-6/+21
The TRBE driver marks the AUX buffer as TRUNCATED when we get an IRQ on FILL event. This has rather unwanted side-effect of the event being disabled when there may be more space in the ring buffer. So, instead of TRUNCATE we need a different flag to indicate that the trace may have lost a few bytes (i.e from the point of generating the FILL event until the IRQ is consumed). Anyways, the userspace must use the size from RECORD_AUX headers to restrict the "trace" decoding. Using PARTIAL flag causes the perf tool to generate the following warning: Warning: AUX data had gaps in it XX times out of YY! Are you running a KVM guest in the background? which is pointlessly scary for a user. The other remaining options are : - COLLISION - Use by SPE to indicate samples collided - Add a new flag - Specifically for CoreSight, doesn't sound so good, if we can re-use something. Given that we don't already use the "COLLISION" flag, the above behavior can be notified using this flag for CoreSight. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: James Clark <james.clark@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Leo Yan <leo.yan@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210923143919.2944311-4-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: Fix handling of spurious interruptsSuzuki K Poulose1-13/+9
On a spurious IRQ, right now we disable the TRBE and then re-enable it back, resetting the "buffer" pointers(i.e BASE, LIMIT and more importantly WRITE) to the original pointers from the AUX handle. This implies that we overwrite any trace that was written so far, (by overwriting TRBPTR) while we should have ignored the IRQ. On detecting a spurious IRQ after examining the TRBSR we simply re-enable the TRBE without touching the other parameters. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210923143919.2944311-3-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: irq handler: Do not disable TRBE if no action is neededSuzuki K Poulose1-6/+6
The IRQ handler of the TRBE driver could race against the update_buffer() in consuming the IRQ. So, if the update_buffer() gets to processing the TRBE irq, the TRBSR will be cleared. Thus by the time IRQ handler is triggered, there is nothing to do there. Handle these cases and do not disable the TRBE unnecessarily. Since the TRBSR can be read without stopping the TRBE, we can check that before disabling the TRBE. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210923143919.2944311-2-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: Unify the enabling sequenceSuzuki K Poulose1-19/+18
Unify the sequence of enabling the TRBE. We do this from event_start and also from the TRBE IRQ handler. Lets move this to a common helper. The only minor functional change is returning an error when we fail to enable the TRBE. This should be handled already. Since we now have unique entry point to trying to enable TRBE, move the format flag setting to the central place. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-9-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: Drop duplicate TRUNCATE flagsSuzuki K Poulose1-6/+6
We mark the buffer as TRUNCATED when there is no space left in the buffer. But we do it at different points. __trbe_normal_offset() and also, at all the callers of the above function via compute_trbe_buffer_limit(), when the limit == base (i.e offset = 0 as returned by the __trbe_normal_offset()). So, given that the callers already mark the buffer as TRUNCATED drop the caller inside the __trbe_normal_offset(). This is in preparation to moving the handling of TRUNCATED into a central place. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-6-suzuki.poulose@arm.com [Moved comment as Anshuman requested] Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: trbe: Ensure the format flag is always setSuzuki K Poulose1-4/+3
When the TRBE is stopped on truncating an event, we may not set the FORMAT flag, even though the size of the record is 0. Let us be consistent and not confuse the user. To ensure that the format flag is always set on all the records generated by TRBE, set the flag when we have a new handle. Rather than deferring to the "end" operation, which makes it clear. So, we can do this from - arm_trbe_enable() -> When a new handle is provided by the CoreSight PMU, triggered via etm_event_start() - trbe_handle_overflow() -> When we begin a new handle after closing the previous on overflow. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-5-suzuki.poulose@arm.com [Fixed inverted words in title] Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: etm-pmu: Ensure the AUX handle is validSuzuki K Poulose1-3/+24
The ETM perf infrastructure closes out a handle during event_stop or on an error in starting the event. In either case, it is possible for a "sink" to update/close the handle, under certain circumstances. (e.g no space in ring buffer.). So, ensure that we handle this gracefully in the PMU driver by verifying the handle is still valid. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Leo Yan <leo.yan@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-4-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: etm4x: Use Trace Filtering controls dynamicallySuzuki K Poulose3-18/+59
The Trace Filtering support (FEAT_TRF) ensures that the ETM can be prohibited from generating any trace for a given EL. This is much stricter knob, than the TRCVICTLR exception level masks, which doesn't prevent the ETM from generating Context packets for an "excluded" EL. At the moment, we do a onetime enable trace at user and kernel and leave it untouched for the kernel life time. This implies that the ETM could potentially generate trace packets containing the kernel addresses, and thus leaking the kernel virtual address in the trace. This patch makes the switch dynamic, by honoring the filters set by the user and enforcing them in the TRFCR controls. We also rename the cpu_enable_tracing() appropriately to cpu_detect_trace_filtering() and the drvdata member trfc => trfcr to indicate the "value" of the TRFCR_EL1. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Al Grant <al.grant@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-3-suzuki.poulose@arm.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: etm4x: Save restore TRFCR_EL1Suzuki K Poulose3-12/+57
When the CPU enters a low power mode, the TRFCR_EL1 contents could be reset. Thus we need to save/restore the TRFCR_EL1 along with the ETM4x registers to allow the tracing. The TRFCR related helpers are in a new header file, as we need to use them for TRBE in the later patches. Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210914102641.1852544-2-suzuki.poulose@arm.com [Fixed cosmetic details] Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: Don't immediately close events that are run on invalid CPU/sink combosJames Clark1-6/+23
When a traced process runs on a CPU that can't reach the selected sink, the event will be stopped with PERF_HES_STOPPED. This means that even if the process migrates to a valid CPU, tracing will not resume. This can be reproduced (on N1SDP) by using taskset to start the process on CPU 0, and then switching it to CPU 2 (ETF 1 is only reachable from CPU 2): taskset --cpu-list 0 ./perf record -e cs_etm/@tmc_etf1/ --per-thread -- taskset --cpu-list 2 ls This produces a single 0 length AUX record, and then no more trace: 0x3c8 [0x30]: PERF_RECORD_AUX offset: 0 size: 0 flags: 0x1 [T] After the fix, the same command produces normal AUX records. The perf self test "89: Check Arm CoreSight trace data recording and synthesized samples" no longer fails intermittently. This was because the taskset in the test is after the fork, so there is a period where the task is scheduled on a random CPU rather than forced to a valid one. Specifically selecting an invalid CPU will still result in a failure to open the event because it will never produce trace: ./perf record -C 2 -e cs_etm/@tmc_etf0/ failed to mmap with 12 (Cannot allocate memory) The only scenario that has changed is if the CPU mask has a valid CPU sink combo in it. Testing ======= * Coresight self test passes consistently: ./perf test Coresight * CPU wide mode still produces trace: ./perf record -e cs_etm// -a * Invalid -C options still fail to open: ./perf record -C 2,3 -e cs_etm/@tmc_etf0/ failed to mmap with 12 (Cannot allocate memory) * Migrating a task to a valid sink/CPU now produces trace: taskset --cpu-list 0 ./perf record -e cs_etm/@tmc_etf1/ --per-thread -- taskset --cpu-list 2 ls * If the task remains on an invalid CPU, no trace is emitted: taskset --cpu-list 0 ./perf record -e cs_etm/@tmc_etf1/ --per-thread -- ls Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: James Clark <james.clark@arm.com> Link: https://lore.kernel.org/r/20210922125144.133872-2-james.clark@arm.com Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: tmc-etr: Speed up for bounce buffer in flat modeLeo Yan1-4/+22
The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with dma_alloc_noncoherent(), the driver can access the memory via cacheable mapping; therefore, load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2525.420 us | } Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210905032144.966766-1-leo.yan@linaro.org Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: Update comments for removing cs_etm_find_snapshot()Leo Yan3-9/+6
Commit 2f01c200d440 ("perf cs-etm: Remove callback cs_etm_find_snapshot()") has removed the function cs_etm_find_snapshot() from the perf tool in the user space, now CoreSight trace directly uses the perf common function __auxtrace_mmap__read() to calcualte the head and size for AUX trace data in snapshot mode. This patch updates the comments in drivers to make them generic and not stick to any specific function from perf tool. Signed-off-by: Leo Yan <leo.yan@linaro.org> Link: https://lore.kernel.org/r/20210912125748.2816606-3-leo.yan@linaro.org Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: tmc-etr: Use perf_output_handle::head for AUX ring bufferLeo Yan1-7/+3
When enable the Arm CoreSight PMU event, the context for AUX ring buffer is prepared in the structure perf_output_handle, and its field "head" points the head of the AUX ring buffer and it is updated after filling AUX trace data into buffer. Current code uses an extra field etr_perf_buffer::head to maintain the header for the AUX ring buffer which is not necessary; alternatively, it's better to directly use perf_output_handle::head. This patch removes the field etr_perf_buffer::head and directly uses perf_output_handle::head for the head of AUX ring buffer. Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210912125748.2816606-2-leo.yan@linaro.org Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: tmc-etf: Add comment for store orderingLeo Yan1-0/+5
Since the function CS_LOCK() has contained memory barrier mb(), it ensures the visibility of the AUX trace data before updating the aux_head, thus it's needless to add any explicit barrier anymore. Add comment to make clear for the barrier usage for ETF. Signed-off-by: Leo Yan <leo.yan@linaro.org> Link: https://lore.kernel.org/r/20210809111407.596077-4-leo.yan@linaro.org Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: tmc-etr: Add barrier after updating AUX ring bufferLeo Yan1-0/+8
Since a memory barrier is required between AUX trace data store and aux_head store, and the AUX trace data is filled with memcpy(), it's sufficient to use smp_wmb() so can ensure the trace data is visible prior to updating aux_head. Signed-off-by: Leo Yan <leo.yan@linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20210809111407.596077-3-leo.yan@linaro.org Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: tmc: Configure AXI write burst sizeTanmay Jagdale3-4/+26
The current driver sets the write burst size initiated by TMC-ETR on AXI bus to a fixed value of 16. Make this configurable by reading the value specified in fwnode. If not specified, then default to 16. Introduced a "max_burst_size" variable in tmc_drvdata structure to facilitate this change. Signed-off-by: Tanmay Jagdale <tanmay@marvell.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Link: https://lore.kernel.org/r/20210901131049.1365367-3-tanmay@marvell.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27dt-bindings: coresight: Add burst size for TMCTanmay Jagdale1-0/+5
Add "arm,max-burst-size" optional property for TMC ETR. If specified, this value indicates the maximum burst size that can be initiated by TMC on the AXI bus. Signed-off-by: Tanmay Jagdale <tanmay@marvell.com> Reviewed-by: Mike Leach <mike.leach@linaro.org> Acked-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20210901131049.1365367-2-tanmay@marvell.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: cpu-debug: Control default behavior via KconfigBrian Norris2-1/+14
Debugfs is nice and so are module parameters, but * debugfs doesn't take effect early (e.g., if drivers are locking up before user space gets anywhere) * module parameters either add a lot to the kernel command line, or else take effect late as well (if you build =m and configure in /etc/modprobe.d/) So in the same spirit as these CONFIG_PANIC_ON_OOPS (also available via cmdline or modparam) CONFIG_INTEL_IOMMU_DEFAULT_ON (also available via cmdline) add a new Kconfig option. Module parameters and debugfs can still override. Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Leo Yan <leo.yan@linaro.org> [Fixed missing double quote in Kconfig title] Link: https://lore.kernel.org/r/20210903182839.1.I20856983f2841b78936134dcf9cdf6ecafe632b9@changeid Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-27coresight: cti: Correct the parameter for pm_runtime_putTao Zhang1-1/+1
The input parameter of the function pm_runtime_put should be the same in the function cti_enable_hw and cti_disable_hw. The correct parameter to use here should be dev->parent. Signed-off-by: Tao Zhang <quic_taozha@quicinc.com> Reviewed-by: Leo Yan <leo.yan@linaro.org> Fixes: 835d722ba10a ("coresight: cti: Initial CoreSight CTI Driver") Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1629365377-5937-1-git-send-email-quic_taozha@quicinc.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2021-10-21arm64: errata: Add detection for TRBE write to out-of-rangeSuzuki K Poulose4-0/+66
Arm Neoverse-N2 and Cortex-A710 cores are affected by an erratum where the trbe, under some circumstances, might write upto 64bytes to an address after the Limit as programmed by the TRBLIMITR_EL1.LIMIT. This might - - Corrupt a page in the ring buffer, which may corrupt trace from a previous session, consumed by userspace. - Hit the guard page at the end of the vmalloc area and raise a fault. To keep the handling simpler, we always leave the last page from the range, which TRBE is allowed to write. This can be achieved by ensuring that we always have more than a PAGE worth space in the range, while calculating the LIMIT for TRBE. And then the LIMIT pointer can be adjusted to leave the PAGE (TRBLIMITR.LIMIT -= PAGE_SIZE), out of the TRBE range while enabling it. This makes sure that the TRBE will only write to an area within its allowed limit (i.e, [head-head+size]) and we do not have to handle address faults within the driver. Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20211019163153.3692640-5-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-21arm64: errata: Add workaround for TSB flush failuresSuzuki K Poulose5-1/+72
Arm Neoverse-N2 (#2067961) and Cortex-A710 (#2054223) suffers from errata, where a TSB (trace synchronization barrier) fails to flush the trace data completely, when executed from a trace prohibited region. In Linux we always execute it after we have moved the PE to trace prohibited region. So, we can apply the workaround every time a TSB is executed. The work around is to issue two TSB consecutively. NOTE: This errata is defined as LOCAL_CPU_ERRATUM, implying that a late CPU could be blocked from booting if it is the first CPU that requires the workaround. This is because we do not allow setting a cpu_hwcaps after the SMP boot. The other alternative is to use "this_cpu_has_cap()" instead of the faster system wide check, which may be a bit of an overhead, given we may have to do this in nvhe KVM host before a guest entry. Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Marc Zyngier <maz@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20211019163153.3692640-4-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-21arm64: errata: Add detection for TRBE overwrite in FILL modeSuzuki K Poulose4-0/+71
Arm Neoverse-N2 and the Cortex-A710 cores are affected by a CPU erratum where the TRBE will overwrite the trace buffer in FILL mode. The TRBE doesn't stop (as expected in FILL mode) when it reaches the limit and wraps to the base to continue writing upto 3 cache lines. This will overwrite any trace that was written previously. Add the Neoverse-N2 erratum(#2139208) and Cortex-A710 erratum (#2119858) to the detection logic. This will be used by the TRBE driver in later patches to work around the issue. The detection has been kept with the core arm64 errata framework list to make sure : - We don't duplicate the framework in TRBE driver - The errata detection is advertised like the rest of the CPU errata. Note that the Kconfig entries are not fully active until the TRBE driver implements the work around. Cc: Will Deacon <will@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> cc: Leo Yan <leo.yan@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20211019163153.3692640-3-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-10-21arm64: Add Neoverse-N2, Cortex-A710 CPU part definitionSuzuki K Poulose1-0/+4
Add the CPU Partnumbers for the new Arm designs. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20211019163153.3692640-2-suzuki.poulose@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-09-26Linux 5.15-rc3Linus Torvalds1-1/+1
2021-09-24ksmbd: use LOOKUP_BENEATH to prevent the out of share accessHyunchul Lee5-206/+140
instead of removing '..' in a given path, call kern_path with LOOKUP_BENEATH flag to prevent the out of share access. ran various test on this: smb2-cat-async smb://127.0.0.1/homes/../out_of_share smb2-cat-async smb://127.0.0.1/homes/foo/../../out_of_share smbclient //127.0.0.1/homes -c "mkdir ../foo2" smbclient //127.0.0.1/homes -c "rename bar ../bar" Cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> Cc: Ralph Boehme <slow@samba.org> Tested-by: Steve French <smfrench@gmail.com> Tested-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-09-24mm: fix uninitialized use in overcommit_policy_handlerChen Jun1-2/+2
We get an unexpected value of /proc/sys/vm/overcommit_memory after running the following program: int main() { int fd = open("/proc/sys/vm/overcommit_memory", O_RDWR); write(fd, "1", 1); write(fd, "2", 1); close(fd); } write(fd, "2", 1) will pass *ppos = 1 to proc_dointvec_minmax. proc_dointvec_minmax will return 0 without setting new_policy. t.data = &new_policy; ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos) -->do_proc_dointvec -->__do_proc_dointvec if (write) { if (proc_first_pos_non_zero_ignore(ppos, table)) goto out; sysctl_overcommit_memory = new_policy; so sysctl_overcommit_memory will be set to an uninitialized value. Check whether new_policy has been changed by proc_dointvec_minmax. Link: https://lkml.kernel.org/r/20210923020524.13289-1-chenjun102@huawei.com Fixes: 56f3547bfa4d ("mm: adjust vm_committed_as_batch according to vm overcommit policy") Signed-off-by: Chen Jun <chenjun102@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Rui Xiang <rui.xiang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/memory_failure: fix the missing pte_unmap() callQi Zheng1-5/+5
The paired pte_unmap() call is missing before the dev_pagemap_mapping_shift() returns. So fix it. David says: "I guess this code never runs on 32bit / highmem, that's why we didn't notice so far". [akpm@linux-foundation.org: cleanup] Link: https://lkml.kernel.org/r/20210923122642.4999-1-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24kasan: always respect CONFIG_KASAN_STACKNathan Chancellor1-1/+2
Currently, the asan-stack parameter is only passed along if CFLAGS_KASAN_SHADOW is not empty, which requires KASAN_SHADOW_OFFSET to be defined in Kconfig so that the value can be checked. In RISC-V's case, KASAN_SHADOW_OFFSET is not defined in Kconfig, which means that asan-stack does not get disabled with clang even when CONFIG_KASAN_STACK is disabled, resulting in large stack warnings with allmodconfig: drivers/video/fbdev/omap2/omapfb/displays/panel-lgphilips-lb035q02.c:117:12: error: stack frame size (14400) exceeds limit (2048) in function 'lb035q02_connect' [-Werror,-Wframe-larger-than] static int lb035q02_connect(struct omap_dss_device *dssdev) ^ 1 error generated. Ensure that the value of CONFIG_KASAN_STACK is always passed along to the compiler so that these warnings do not happen when CONFIG_KASAN_STACK is disabled. Link: https://github.com/ClangBuiltLinux/linux/issues/1453 References: 6baec880d7a5 ("kasan: turn off asan-stack for clang-8 and earlier") Link: https://lkml.kernel.org/r/20210922205525.570068-1-nathan@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24sh: pgtable-3level: fix cast to pointer from integer of different sizeGeert Uytterhoeven1-1/+1
If X2TLB=y (CPU_SHX2=y or CPU_SHX3=y, e.g. migor_defconfig), pgd_t.pgd is "unsigned long long", causing: In file included from arch/sh/include/asm/pgtable.h:13, from include/linux/pgtable.h:6, from include/linux/mm.h:33, from arch/sh/kernel/asm-offsets.c:14: arch/sh/include/asm/pgtable-3level.h: In function `pud_pgtable': arch/sh/include/asm/pgtable-3level.h:37:9: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 37 | return (pmd_t *)pud_val(pud); | ^ Fix this by adding an intermediate cast to "unsigned long", which is basically what the old code did before. Link: https://lkml.kernel.org/r/2c2eef3c9a2f57e5609100a4864715ccf253d30f.1631713483.git.geert+renesas@glider.be Fixes: 9cf6fa2458443118 ("mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Daniel Palmer <daniel@thingy.jp> Acked-by: Rob Landley <rob@landley.net> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Rich Felker <dalias@libc.org> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Jacopo Mondi <jacopo+renesas@jmondi.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/debug: sync up latest migrate_reason to migrate_reason_namesWeizhao Ouyang2-1/+6
Sync up MR_DEMOTION to migrate_reason_names and add a synch prompt. Link: https://lkml.kernel.org/r/20210921064553.293905-3-o451686892@gmail.com Fixes: 26aa2d199d6f ("mm/migrate: demote pages during reclaim") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/debug: sync up MR_CONTIG_RANGE and MR_LONGTERM_PINWeizhao Ouyang1-1/+2
Sync up MR_CONTIG_RANGE and MR_LONGTERM_PIN to migrate_reason_names. Link: https://lkml.kernel.org/r/20210921064553.293905-2-o451686892@gmail.com Fixes: 310253514bbf ("mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE") Fixes: d1e153fea2a8 ("mm/gup: migrate pinned pages out of movable zone") Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Wei Xu <weixugc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm: fs: invalidate bh_lrus for only cold pathMinchan Kim3-7/+24
The kernel test robot reported the regression of fio.write_iops[1] with commit 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration"). Since lru_add_drain is called frequently, invalidate bh_lrus there could increase bh_lrus cache miss ratio, which needs more IO in the end. This patch moves the bh_lrus invalidation from the hot path( e.g., zap_page_range, pagevec_release) to cold path(i.e., lru_add_drain_all, lru_cache_disable). Zhengjun Xing confirmed "I test the patch, the regression reduced to -2.9%" [1] https://lore.kernel.org/lkml/20210520083144.GD14190@xsang-OptiPlex-9020/ [2] 8cc621d2f45d, mm: fs: invalidate BH LRU during page migration Link: https://lkml.kernel.org/r/20210907212347.1977686-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Tested-by: "Xing, Zhengjun" <zhengjun.xing@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24lib/zlib_inflate/inffast: check config in C to avoid unused function warningPaul Menzel1-7/+6
Building Linux for ppc64le with Ubuntu clang version 12.0.0-3ubuntu1~21.04.1 shows the warning below. arch/powerpc/boot/inffast.c:20:1: warning: unused function 'get_unaligned16' [-Wunused-function] get_unaligned16(const unsigned short *p) ^ 1 warning generated. Fix it by moving the check from the preprocessor to C, so the compiler sees the use. Link: https://lkml.kernel.org/r/20210920084332.5752-1-pmenzel@molgen.mpg.de Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24tools/vm/page-types: remove dependency on opt_file for idle page trackingChangbin Du1-1/+1
Idle page tracking can also be used for process address space, not only file mappings. Without this change, using with '-i' option for process address space encounters below errors reported. $ sudo ./page-types -p $(pidof bash) -i mark page idle: Bad file descriptor mark page idle: Bad file descriptor mark page idle: Bad file descriptor mark page idle: Bad file descriptor ... Link: https://lkml.kernel.org/r/20210917032826.10669-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24scripts/sorttable: riscv: fix undeclared identifier 'EM_RISCV' errorMiles Chen1-0/+4
Fix the following build failure reported in [1] by adding a conditional definition of EM_RISCV in order to allow cross-compilation on machines which do not have EM_RISCV definition in their host. scripts/sorttable.c:352:7: error: use of undeclared identifier 'EM_RISCV' EM_RISCV was added to <elf.h> in glibc 2.24 so builds on systems with glibc headers < 2.24 should show this error. [mkubecek@suse.cz: changelog addition] Link: https://lore.kernel.org/lkml/e8965b25-f15b-c7b4-748c-d207dda9c8e8@i2se.com/ [1] Link: https://lkml.kernel.org/r/20210913030625.4525-1-miles.chen@mediatek.com Fixes: 54fed35fd393 ("riscv: Enable BUILDTIME_TABLE_SORT") Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reported-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Reviewed-by: Jisheng Zhang <jszhang@kernel.org> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Markus Mayer <mmayer@broadcom.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24ocfs2: drop acl cache for directories tooWengang Wang1-1/+2
ocfs2_data_convert_worker() is currently dropping any cached acl info for FILE before down-converting meta lock. It should also drop for DIRECTORY. Otherwise the second acl lookup returns the cached one (from VFS layer) which could be already stale. The problem we are seeing is that the acl changes on one node doesn't get refreshed on other nodes in the following case: Node 1 Node 2 -------------- ---------------- getfacl dir1 getfacl dir1 <-- this is OK setfacl -m u:user1:rwX dir1 getfacl dir1 <-- see the change for user1 getfacl dir1 <-- can't see change for user1 Link: https://lkml.kernel.org/r/20210903012631.6099-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/shmem.c: fix judgment error in shmem_is_huge()Liu Yuntao1-2/+2
In the case of SHMEM_HUGE_WITHIN_SIZE, the page index is not rounded up correctly. When the page index points to the first page in a huge page, round_up() cannot bring it to the end of the huge page, but to the end of the previous one. An example: HPAGE_PMD_NR on my machine is 512(2 MB huge page size). After allcoating a 3000 KB buffer, I access it at location 2050 KB. In shmem_is_huge(), the corresponding index happens to be 512. After rounded up by HPAGE_PMD_NR, it will still be 512 which is smaller than i_size, and shmem_is_huge() will return true. As a result, my buffer takes an additional huge page, and that shouldn't happen when shmem_enabled is set to within_size. Link: https://lkml.kernel.org/r/20210909032007.18353-1-liuyuntao10@huawei.com Fixes: f3f0e1d2150b2b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Liu Yuntao <liuyuntao10@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: wuxu.wu <wuxu.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24xtensa: increase size of gcc stack frame checkGuenter Roeck1-1/+1
xtensa frame size is larger than the frame size for almost all other architectures. This results in more than 50 "the frame size of <n> is larger than 1024 bytes" errors when trying to build xtensa:allmodconfig. Increase frame size for xtensa to 1536 bytes to avoid compile errors due to frame size limits. Link: https://lkml.kernel.org/r/20210912025235.3514761-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm/damon: don't use strnlen() with known-bogus source lengthAdam Borowski1-8/+8
gcc knows the true length too, and rightfully complains. Link: https://lkml.kernel.org/r/20210912204447.10427-1-kilobyte@angband.pl Signed-off-by: Adam Borowski <kilobyte@angband.pl> Cc: SeongJae Park <sj38.park@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24kasan: fix Kconfig check of CC_HAS_WORKING_NOSANITIZE_ADDRESSMarco Elver1-0/+2
In the main KASAN config option CC_HAS_WORKING_NOSANITIZE_ADDRESS is checked for instrumentation-based modes. However, if HAVE_ARCH_KASAN_HW_TAGS is true all modes may still be selected. To fix, also make the software modes depend on CC_HAS_WORKING_NOSANITIZE_ADDRESS. Link: https://lkml.kernel.org/r/20210910084240.1215803-1-elver@google.com Fixes: 6a63a63ff1ac ("kasan: introduce CONFIG_KASAN_HW_TAGS") Signed-off-by: Marco Elver <elver@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Aleksandr Nogikh <nogikh@google.com> Cc: Taras Madan <tarasmadan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24mm, hwpoison: add is_free_buddy_page() in HWPoisonHandlable()Naoya Horiguchi1-1/+1
Commit fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") changed the return value of __get_hwpoison_page() to retry for transiently unhandlable cases. However, __get_hwpoison_page() currently fails to properly judge buddy pages as handlable, so hard/soft offline for buddy pages always fail as "unhandlable page". This is totally regrettable. So let's add is_free_buddy_page() in HWPoisonHandlable(), so that __get_hwpoison_page() returns different return values between buddy pages and unhandlable pages as intended. Link: https://lkml.kernel.org/r/20210909004131.163221-1-naoya.horiguchi@linux.dev Fixes: fcc00621d88b ("mm/hwpoison: retry with shake_page() for unhandlable pages") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-24io_uring: make OP_CLOSE consistent with direct openPavel Begunkov1-1/+51
From recently open/accept are now able to manipulate fixed file table, but it's inconsistent that close can't. Close the gap, keep API same as with open/accept, i.e. via sqe->file_slot. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24block: hold ->invalidate_lock in blkdev_fallocateMing Lei1-11/+10
When running ->fallocate(), blkdev_fallocate() should hold mapping->invalidate_lock to prevent page cache from being accessed, otherwise stale data may be read in page cache. Without this patch, blktests block/009 fails sometimes. With this patch, block/009 can pass always. Also as Jan pointed out, no pages can be created in the discarded area while you are holding the invalidate_lock, so remove the 2nd truncate_bdev_range(). Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210923023751.1441091-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24blktrace: Fix uaf in blk_trace access after removing by sysfsZhihao Cheng1-0/+8
There is an use-after-free problem triggered by following process: P1(sda) P2(sdb) echo 0 > /sys/block/sdb/trace/enable blk_trace_remove_queue synchronize_rcu blk_trace_free relay_close rcu_read_lock __blk_add_trace trace_note_tsk (Iterate running_trace_list) relay_close_buf relay_destroy_buf kfree(buf) trace_note(sdb's bt) relay_reserve buf->offset <- nullptr deference (use-after-free) !!! rcu_read_unlock [ 502.714379] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 502.715260] #PF: supervisor read access in kernel mode [ 502.715903] #PF: error_code(0x0000) - not-present page [ 502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0 [ 502.717252] Oops: 0000 [#1] SMP [ 502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360 [ 502.732872] Call Trace: [ 502.733193] __blk_add_trace.cold+0x137/0x1a3 [ 502.733734] blk_add_trace_rq+0x7b/0xd0 [ 502.734207] blk_add_trace_rq_issue+0x54/0xa0 [ 502.734755] blk_mq_start_request+0xde/0x1b0 [ 502.735287] scsi_queue_rq+0x528/0x1140 ... [ 502.742704] sg_new_write.isra.0+0x16e/0x3e0 [ 502.747501] sg_ioctl+0x466/0x1100 Reproduce method: ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sda, BLKTRACESTART) ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sdb, BLKTRACESTART) echo 0 > /sys/block/sdb/trace/enable & // Add delay(mdelay/msleep) before kernel enters blk_trace_free() ioctl$SG_IO(/dev/sda, SG_IO, ...) // Enters trace_note_tsk() after blk_trace_free() returned // Use mdelay in rcu region rather than msleep(which may schedule out) Remove blk_trace from running_list before calling blk_trace_free() by sysfs if blk_trace is at Blktrace_running state. Fixes: c71a896154119f ("blktrace: add ftrace plugin") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24block: don't call rq_qos_ops->done_bio if the bio isn't trackedMing Lei1-1/+1
rq_qos framework is only applied on request based driver, so: 1) rq_qos_done_bio() needn't to be called for bio based driver 2) rq_qos_done_bio() needn't to be called for bio which isn't tracked, such as bios ended from error handling code. Especially in bio_endio(): 1) request queue is referred via bio->bi_bdev->bd_disk->queue, which may be gone since request queue refcount may not be held in above two cases 2) q->rq_qos may be freed in blk_cleanup_queue() when calling into __rq_qos_done_bio() Fix the potential kernel panic by not calling rq_qos_ops->done_bio if the bio isn't tracked. This way is safe because both ioc_rqos_done_bio() and blkcg_iolatency_done_bio() are nop if the bio isn't tracked. Reported-by: Yu Kuai <yukuai3@huawei.com> Cc: tj@kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210924110704.1541818-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: kill extra checks in io_write()Pavel Begunkov1-3/+0
We don't retry short writes and so we would never get to async setup in io_write() in that case. Thus ret2 > 0 is always false and iov_iter_advance() is never used. Apparently, the same is found by Coverity, which complains on the code. Fixes: cd65869512ab ("io_uring: use iov_iter state save/restore helpers") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5b33e61034748ef1022766efc0fb8854cfcf749c.1632500058.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: don't punt files update to io-wq unconditionallyJens Axboe1-5/+2
There's no reason to punt it unconditionally, we just need to ensure that the submit lock grabbing is conditional. Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-24io_uring: put provided buffer meta data under memcg accountingJens Axboe1-1/+1
For each provided buffer, we allocate a struct io_buffer to hold the data associated with it. As a large number of buffers can be provided, account that data with memcg. Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Jens Axboe <axboe@kernel.dk>