Age | Commit message (Collapse) | Author | Files | Lines |
|
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Registering and unregistering cpuhp callback requires global cpu hotplug lock,
which is used everywhere. Meantime q->sysfs_lock is used in block layer
almost everywhere.
It is easy to trigger lockdep warning[1] by connecting the two locks.
Fix the warning by moving blk-mq's cpuhp callback registering out of
q->sysfs_lock. Add one dedicated global lock for covering registering &
unregistering hctx's cpuhp, and it is safe to do so because hctx is
guaranteed to be live if our request_queue is live.
[1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Reported-by: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the
callback should be registered after this 'hctx' is added to xarray table.
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Newman <peternewman@google.com>
Cc: Babu Moger <babu.moger@amd.com>
Cc: Luck Tony <tony.luck@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 4ce6e2db00de ("virtio-blk: Ensure no requests in virtqueues before
deleting vqs.") replaces queue quiesce with queue freeze in virtio-blk's
PM callbacks. And the motivation is to drain inflight IOs before suspending.
block layer's queue freeze looks very handy, but it is also easy to cause
deadlock, such as, any attempt to call into bio_queue_enter() may run into
deadlock if the queue is frozen in current context. There are all kinds
of ->suspend() called in suspend context, so keeping queue frozen in the
whole suspend context isn't one good idea. And Marek reported lockdep
warning[1] caused by virtio-blk's freeze queue in virtblk_freeze().
[1] https://lore.kernel.org/linux-block/ca16370e-d646-4eee-b9cc-87277c89c43c@samsung.com/
Given the motivation is to drain in-flight IOs, it can be done by calling
freeze & unfreeze, meantime restore to previous behavior by keeping queue
quiesced during suspend.
Cc: Yi Sun <yi.sun@unisoc.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: virtualization@lists.linux.dev
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Link: https://lore.kernel.org/r/20241112125821.1475793-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As nvme_tcp_teardown_io_queues() is the only one caller of
nvme_tcp_destroy_admin_queue(), so we can merge it into
nvme_tcp_teardown_io_queues() to simplify the code.
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
As we quiesce admin_q in nvme_tcp_teardown_admin_queue(), so we should no
need to quiesce it in nvme_tcp_reaardown_io_queues(), make things simple.
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Kernel will hang on destroy admin_q while we create ctrl failed, such
as following calltrace:
PID: 23644 TASK: ff2d52b40f439fc0 CPU: 2 COMMAND: "nvme"
#0 [ff61d23de260fb78] __schedule at ffffffff8323bc15
#1 [ff61d23de260fc08] schedule at ffffffff8323c014
#2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1
#3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a
#4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006
#5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce
#6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced
#7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b
#8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362
#9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25
RIP: 00007fda7891d574 RSP: 00007ffe2ef06958 RFLAGS: 00000202
RAX: ffffffffffffffda RBX: 000055e8122a4d90 RCX: 00007fda7891d574
RDX: 000000000000012b RSI: 000055e8122a4d90 RDI: 0000000000000004
RBP: 00007ffe2ef079c0 R8: 000000000000012b R9: 000055e8122a4d90
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004
R13: 000055e8122923c0 R14: 000000000000012b R15: 00007fda78a54500
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
This due to we have quiesced admi_q before cancel requests, but forgot
to unquiesce before destroy it, as a result we fail to drain the
pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here
try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and
simplify the code.
Fixes: 958dc1d32c80 ("nvme-rdma: add clean action for failed reconnection")
Reported-by: Yingfu.zhou <yingfu.zhou@shopee.com>
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Signed-off-by: Yue.zhao <yue.zhao@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Now while we create new ctrl failed, we have not free the
tagset occupied by admin_q, here try to fix it.
Fixes: fd1418de10b9 ("nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()")
Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Only call into nvme_alloc_host_mem_single which uses
dma_alloc_noncontiguous when there is non-null dma merge boundary.
Without this we'll call into dma_alloc_noncontiguous for device using
dma-direct, which can work fine as long as the preferred size is below the
MAX_ORDER of the page allocator, but blows up with a warning if it is
too large.
Fixes: 63a5c7a4b4c4 ("nvme-pci: use dma_alloc_noncontigous if possible")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
cocci warnings: (new ones prefixed by >>)
>> drivers/nvme/target/pr.c:831:8-15: WARNING: kzalloc should be used for data, instead of kmalloc/memset
The pattern of using 'kmalloc' followed by 'memset' is replaced with
'kzalloc', which is functionally equivalent to 'kmalloc' + 'memset',
but more efficient. 'kzalloc' automatically zeroes the allocated
memory, making it a faster and more streamlined solution.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202411301434.LEckbcWx-lkp@intel.com/
Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The NVMe specification states that MAXCMD is mandatory
for NVMe-over-Fabrics implementations. However, some NVMe/TCP
and NVMe/FC arrays from major vendors have buggy firmware
that reports MAXCMD as zero in the Identify Controller data structure.
Currently, the implementation closes the connection in such cases,
completely preventing the host from connecting to the target.
Fix the issue by printing a clear error message about the firmware bug
and allowing the connection to proceed. It assumes that the
target supports a MAXCMD value of SQSIZE + 1. If any issues arise,
the user can manually adjust SQSIZE to mitigate them.
Fixes: 4999568184e5 ("nvme-fabrics: check max outstanding commands")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Commit 028ddcac477b ("bcache: Remove unnecessary NULL point check in
node allocations") leads a NULL pointer deference in cache_set_flush().
1721 if (!IS_ERR_OR_NULL(c->root))
1722 list_add(&c->root->list, &c->btree_cache);
>From the above code in cache_set_flush(), if previous registration code
fails before allocating c->root, it is possible c->root is NULL as what
it is initialized. __bch_btree_node_alloc() never returns NULL but
c->root is possible to be NULL at above line 1721.
This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this.
Fixes: 028ddcac477b ("bcache: Remove unnecessary NULL point check in node allocations")
Signed-off-by: Liequan Che <cheliequan@inspur.com>
Cc: stable@vger.kernel.org
Cc: Zheng Wang <zyytlz.wz@163.com>
Reviewed-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20241202115638.28957-1-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The quirk was initially used as a signal to set the discard_zeroes_data
queue limit because there were some use cases that relied on that
behavior. The queue limit no longer exists as every user of it has been
converted to use the write zeroes operation instead.
The quirk now means to use a discard command as an alias to a write
zeroes request. Two of the devices previously using the quirk support
the write zeroes command directly, so these don't need or want to use
discard when the desired operation is to write zeroes.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add the missing description to fix the following warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/rnull_mod.o
Signed-off-by: FUJITA Tomonori <fujita.tomonori@gmail.com>
Acked-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20241130094521.193924-1-fujita.tomonori@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Clean up the existing export namespace code along the same lines of
commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.
Scripted using
git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
do
awk -i inplace '
/^#define EXPORT_SYMBOL_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/^#define MODULE_IMPORT_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/MODULE_IMPORT_NS/ {
$0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
}
/EXPORT_SYMBOL_NS/ {
if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
$0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
$0 !~ /^my/) {
getline line;
gsub(/[[:space:]]*\\$/, "");
gsub(/[[:space:]]/, "", line);
$0 = $0 " " line;
}
$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
"\\1(\\2, \"\\3\")", "g");
}
}
{ print }' $file;
done
Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of
nvme_config_discard") started applying the NVME_QUIRK_DEALLOCATE_ZEROES
quirk even then the Dataset Management is not supported. It turns out
that there versions of these old Intel SSDs that have DSM support
disabled in the firmware, which will now lead to errors everytime
a Write Zeroes command is issued. Fix this by checking for DSM support
before applying the quirk.
Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Fixes: 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard")
Tested-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The nvme_execute_identify_ns_nvm function uses ZERO_PAGE for copying
SG list with all zeros. As ZERO_PAGE would not necessarily return the
virtual-address of the zero page, we need to first convert the page
address to kernel virtual-address and then use it as source address
for copying the data to SG list with all zeros. Using return address
of ZERO_PAGE(0) as source address for copying data to SG list would
fill the target buffer with random/garbage value and causes the
undesired side effect.
As other identify implemenations uses kzalloc for allocating a zero
filled buffer, we decided use kzalloc for allocating a zero filled
buffer in nvme_execute_identify_ns_nvm function and then use this
buffer for copying all zeros to SG list buffers. So esentially, we
now avoid using ZERO_PAGE.
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 64a51080eaba ("nvmet: implement id ns for nvm command set")
Link: https://lore.kernel.org/all/CAHj4cs8OVyxmn4XTvA=y4uQ3qWpdw-x3M3FSUYr-KpE-nhaFEA@mail.gmail.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The continual trickle of small conversion patches is grating on me, and
is really not helping. Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:
/*
* .remove_new() is a relic from a prototype conversion of .remove().
* New drivers are supposed to implement .remove(). Once all drivers are
* converted to not use .remove_new any more, it will be dropped.
*/
This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.
I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.
Then I just removed the old (sic) .remove_new member function, and this
is the end result. No more unnecessary conversion noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
The point behind strscpy() was to once and for all avoid all the
problems with 'strncpy()' and later broken "fixed" versions like
strlcpy() that just made things worse.
So strscpy not only guarantees NUL-termination (unlike strncpy), it also
doesn't do unnecessary padding at the destination. But at the same time
also avoids byte-at-a-time reads and writes by _allowing_ some extra NUL
writes - within the size, of course - so that the whole copy can be done
with word operations.
It is also stable in the face of a mutable source string: it explicitly
does not read the source buffer multiple times (so an implementation
using "strnlen()+memcpy()" would be wrong), and does not read the source
buffer past the size (like the mis-design that is strlcpy does).
Finally, the return value is designed to be simple and unambiguous: if
the string cannot be copied fully, it returns an actual negative error,
making error handling clearer and simpler (and the caller already knows
the size of the buffer). Otherwise it returns the string length of the
result.
However, there was one final stability issue that can be important to
callers: the stability of the destination buffer.
In particular, the same way we shouldn't read the source buffer more
than once, we should avoid doing multiple writes to the destination
buffer: first writing a potentially non-terminated string, and then
terminating it with NUL at the end does not result in a stable result
buffer.
Yes, it gives the right result in the end, but if the rule for the
destination buffer was that it is _always_ NUL-terminated even when
accessed concurrently with updates, the final byte of the buffer needs
to always _stay_ as a NUL byte.
[ Note that "final byte is NUL" here is literally about the final byte
in the destination array, not the terminating NUL at the end of the
string itself. There is no attempt to try to make concurrent reads and
writes give any kind of consistent string length or contents, but we
do want to guarantee that there is always at least that final
terminating NUL character at the end of the destination array if it
existed before ]
This is relevant in the kernel for the tsk->comm[] array, for example.
Even without locking (for either readers or writers), we want to know
that while the buffer contents may be garbled, it is always a valid C
string and always has a NUL character at 'comm[TASK_COMM_LEN-1]' (and
never has any "out of thin air" data).
So avoid any "copy possibly non-terminated string, and terminate later"
behavior, and write the destination buffer only once.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
bprintf() is unused. Remove it. It was added in the commit 4370aa4aa753
("vsprintf: add binary printf") but as far as I can see was never used,
unlike the other two functions in that patch.
Link: https://lore.kernel.org/20241002173147.210107-1-linux@treblig.org
Reviewed-by: Andy Shevchenko <andy@kernel.org>
Acked-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
since 2024.07.26:
assorted minor bug fixes
assorted platform specific tweaks
initial RAPL PSYS (SysWatt) support
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Introduce the counter as a part of global, platform counters structure.
We open the counter for only one cpu, but otherwise treat it as an
ordinary RAPL counter, allowing for grouped perf read.
The counter is disabled by default, because it's interpretation may
require additional, platform specific information, making it unsuitable
for general use.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Add '+' to optstring when early scanning for --no-msr and --no-perf.
It causes option processing to stop as soon as a nonoption argument is
encountered, effectively skipping child's arguments.
Fixes: 3e4048466c39 ("tools/power turbostat: Add --no-msr option")
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Force the --no-perf early to prevent using it as a source. User asks for
raw values, but perf returns them relative to the opening of the file
descriptor.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
On some machines, the graphics device is enumerated as
/sys/class/drm/card1 instead of /sys/class/drm/card0. The current
implementation does not handle this scenario, resulting in the loss of
graphics C6 residency and frequency information.
Add support for /sys/class/drm/card1, ensuring that turbostat can
retrieve and display the graphics columns for these platforms.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Snapshots of the graphics sysfs knobs are taken based on file
descriptors. To optimize this process, open the files and cache the file
descriptors during the graphics probe phase. As a result, the previously
cached pathnames become redundant and are removed.
This change aims to streamline the code without altering its functionality.
No functional change intended.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Currently, there is an inconsistency in how graphics sysfs knobs are
accessed: graphics residency sysfs knobs are opened and closed for each
read, while graphics frequency sysfs knobs are opened once and remain
open until turbostat exits. This inconsistency is confusing and adds
unnecessary code complexity.
Consolidate the access method by opening the sysfs files once and
reusing the file pointers for subsequent accesses. This approach
simplifies the code and ensures a consistent method for accessing
graphics sysfs knobs.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
The graphics sysfs knobs are read-only, making the use of fflush()
before reading them redundant.
Remove the unnecessary fflush() call.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
In various generations, platforms often share a majority of features,
diverging only in a few specific aspects. The current approach of using
hardcoded values in 'platform_features' structure fails to effectively
represent these divergences.
To improve the description of platform divergence:
1. Each newly introduced 'platform_features' structure must have a base,
typically derived from the previous generation.
2. Platform feature values should be inherited from the base structure
rather than being hardcoded.
This approach ensures a more accurate and maintainable representation of
platform-specific features across different generations.
Converts `adl_features` and `lnl_features` to follow this new scheme.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Add initial support for GraniteRapids-D. It shares the same features
with SapphireRapids.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Lunarlake supports CC1/CC6/CC7/PC2/PC6/PC10.
Remove PC3 support on Lunarlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
As ARL shares the same features with ADL/RPL/MTL, now 'arl_features' is
used by Lunarlake platform only.
Rename 'arl_features' to 'lnl_features'.
No functional change.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Similar to ADL/RPL/MTL, ARL supports CC1/CC6/CC7/PC2/PC3/PC6/PC8/PC10.
Add back PC8 support on Arrowlake.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Similar to ADL/RPL, MTL support CC1/CC6/CC7/PC2/PC3/PC6/PC8/CP10.
Remove PC7/PC9 support on MTL.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Honor --show CPU and --show Core when "topo.num_cpus == 1".
Previously turbostat assumed that on a 1-CPU system, these
columns should never appear.
Honoring these flags makes it easier for several programs
that parse turbostat output.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
parse_cpu_string() parses the string input either from command line or
from /sys/fs/cgroup/cpuset.cpus.effective to get a list of CPUs that
turbostat can run with.
The cpu string returned by /sys/fs/cgroup/cpuset.cpus.effective contains
a trailing '\n', but strtoul() fails to treat this as an error.
That says, for the code below
val = ("\n", NULL, 10);
val returns 0, and errno is also not set.
As a result, CPU0 is erroneously considered as allowed CPU and this
causes failures when turbostat tries to run on CPU0.
get_counters: Could not migrate to CPU 0
...
turbostat: re-initialized with num_cpus 8, allowed_cpus 5
get_counters: Could not migrate to CPU 0
Add a check to return immediately if '\n' or '\0' is detected.
Fixes: 8c3dd2c9e542 ("tools/power/turbostat: Abstrct function for parsing cpu string")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Intel hybrid platforms expose different perf devices for P and E cores.
Instead of one, "/sys/bus/event_source/devices/cpu" device, there are
"/sys/bus/event_source/devices/{cpu_core,cpu_atom}".
This, however makes it more complicated for the user,
because most of the counters are available on both and had to be
handled manually.
This patch allows users to use "virtual" cpu device that is seemingly
translated to cpu_core and cpu_atom perf devices, depending on the type
of a CPU we are opening the counter for.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
If the very first printed column was for a PMT counter of type xtal_time
we would misalign the column header, because we were always printing the
delimiter.
Signed-off-by: Patryk Wlazlyn <patryk.wlazlyn@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
Fix build regression seen when using old gcc-9 compiler.
Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
|
|
If a PCI device has an associated device_node with power supplies,
pci_bus_add_device() creates platform devices for use by pwrctrl. When the
PCI device is removed, pci_stop_dev() uses of_find_device_by_node() to
locate the related platform device, then unregisters it.
But when we remove a PCI device with no associated device node,
dev_of_node(dev) is NULL, and of_find_device_by_node(NULL) returns the
first device with "dev->of_node == NULL". The result is that we (a)
mistakenly unregister a completely unrelated platform device, leading to
issues like the first trace below, and (b) dereference the NULL pointer
from dev_of_node() when clearing OF_POPULATED, as in the second trace.
Unregister a platform device only if there is one associated with this PCI
device. This resolves issues seen when doing:
# echo 1 > /sys/bus/pci/devices/.../remove
Sample issue from unregistering the wrong platform device:
WARNING: CPU: 0 PID: 5095 at drivers/regulator/core.c:5885 regulator_unregister+0x140/0x160
Call trace:
regulator_unregister+0x140/0x160
devm_rdev_release+0x1c/0x30
release_nodes+0x68/0x100
devres_release_all+0x98/0xf8
device_unbind_cleanup+0x20/0x70
device_release_driver_internal+0x1f4/0x240
device_release_driver+0x20/0x40
bus_remove_device+0xd8/0x170
device_del+0x154/0x380
device_unregister+0x28/0x88
of_device_unregister+0x1c/0x30
pci_stop_bus_device+0x154/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Later NULL pointer dereference for of_node_clear_flag(NULL, OF_POPULATED):
Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0
Call trace:
pci_stop_bus_device+0x190/0x1b0
pci_stop_and_remove_bus_device_locked+0x28/0x48
remove_store+0xa0/0xb8
dev_attr_store+0x20/0x40
sysfs_kf_write+0x4c/0x68
Link: https://lore.kernel.org/r/20241126210443.4052876-1-briannorris@chromium.org
Fixes: 681725afb6b9 ("PCI/pwrctl: Remove pwrctl device without iterating over all children of pwrctl parent")
Reported-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Closes: https://lore.kernel.org/r/1732890621-19656-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Brian Norris <briannorris@chromium.org>
[bhelgaas: commit log]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
This reverts commit 3791ea69a4858b81e0277f695ca40f5aae40f312.
It was reported to cause boot-time issues, so revert it for now.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 3791ea69a485 ("serial: sh-sci: Clean sci_ports[0] after at earlycon exit")
Cc: stable <stable@kernel.org>
Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
In the error handling for this function, d is freed without ever
removing it from intc_list which would lead to a use after free.
To fix this, let's only add it to the list after everything has
succeeded.
Fixes: 2dcec7a988a1 ("sh: intc: set_irq_wake() support")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
|
|
When CONFIG_CPUMASK_OFFSTACK and CONFIG_DEBUG_PER_CPU_MAPS are selected,
cpu_max_bits_warn() generates a runtime warning similar as below when
showing /proc/cpuinfo. Fix this by using nr_cpu_ids (the runtime limit)
instead of NR_CPUS to iterate CPUs.
[ 3.052463] ------------[ cut here ]------------
[ 3.059679] WARNING: CPU: 3 PID: 1 at include/linux/cpumask.h:108 show_cpuinfo+0x5e8/0x5f0
[ 3.070072] Modules linked in: efivarfs autofs4
[ 3.076257] CPU: 0 PID: 1 Comm: systemd Not tainted 5.19-rc5+ #1052
[ 3.099465] Stack : 9000000100157b08 9000000000f18530 9000000000cf846c 9000000100154000
[ 3.109127] 9000000100157a50 0000000000000000 9000000100157a58 9000000000ef7430
[ 3.118774] 90000001001578e8 0000000000000040 0000000000000020 ffffffffffffffff
[ 3.128412] 0000000000aaaaaa 1ab25f00eec96a37 900000010021de80 900000000101c890
[ 3.138056] 0000000000000000 0000000000000000 0000000000000000 0000000000aaaaaa
[ 3.147711] ffff8000339dc220 0000000000000001 0000000006ab4000 0000000000000000
[ 3.157364] 900000000101c998 0000000000000004 9000000000ef7430 0000000000000000
[ 3.167012] 0000000000000009 000000000000006c 0000000000000000 0000000000000000
[ 3.176641] 9000000000d3de08 9000000001639390 90000000002086d8 00007ffff0080286
[ 3.186260] 00000000000000b0 0000000000000004 0000000000000000 0000000000071c1c
[ 3.195868] ...
[ 3.199917] Call Trace:
[ 3.203941] [<90000000002086d8>] show_stack+0x38/0x14c
[ 3.210666] [<9000000000cf846c>] dump_stack_lvl+0x60/0x88
[ 3.217625] [<900000000023d268>] __warn+0xd0/0x100
[ 3.223958] [<9000000000cf3c90>] warn_slowpath_fmt+0x7c/0xcc
[ 3.231150] [<9000000000210220>] show_cpuinfo+0x5e8/0x5f0
[ 3.238080] [<90000000004f578c>] seq_read_iter+0x354/0x4b4
[ 3.245098] [<90000000004c2e90>] new_sync_read+0x17c/0x1c4
[ 3.252114] [<90000000004c5174>] vfs_read+0x138/0x1d0
[ 3.258694] [<90000000004c55f8>] ksys_read+0x70/0x100
[ 3.265265] [<9000000000cfde9c>] do_syscall+0x7c/0x94
[ 3.271820] [<9000000000202fe4>] handle_syscall+0xc4/0x160
[ 3.281824] ---[ end trace 8b484262b4b8c24c ]---
Cc: stable@vger.kernel.org
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Reviewed-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
|
|
The number of allocated pages which discarded will not decrease.
Fix it.
Fixes: 9ead7efc6f3f ("brd: implement discard support")
Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20241128170056565nPKSz2vsP8K8X2uk2iaDG@zte.com.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set new allocated bfqq to bic or remove freed bfqq from bic are both
protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq
from bic without the lock, this can lead to UAF if the io_context is
shared by multiple tasks.
For example, test bfq with io_uring can trigger following UAF in v6.6:
==================================================================
BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x80
print_address_description.constprop.0+0x66/0x300
print_report+0x3e/0x70
kasan_report+0xb4/0xf0
bfqq_group+0x15/0x50
bfqq_request_over_limit+0x130/0x9a0
bfq_limit_depth+0x1b5/0x480
__blk_mq_alloc_requests+0x2b5/0xa00
blk_mq_get_new_requests+0x11d/0x1d0
blk_mq_submit_bio+0x286/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__block_write_full_folio+0x3d0/0x640
writepage_cb+0x3b/0xc0
write_cache_pages+0x254/0x6c0
write_cache_pages+0x254/0x6c0
do_writepages+0x192/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 808602:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x83/0x90
kmem_cache_alloc_node+0x1b1/0x6d0
bfq_get_queue+0x138/0xfa0
bfq_get_bfqq_handle_split+0xe3/0x2c0
bfq_init_rq+0x196/0xbb0
bfq_insert_request.isra.0+0xb5/0x480
bfq_insert_requests+0x156/0x180
blk_mq_insert_request+0x15d/0x440
blk_mq_submit_bio+0x8a4/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__blkdev_direct_IO_async+0x2dd/0x330
blkdev_write_iter+0x39a/0x450
io_write+0x22a/0x840
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Freed by task 808589:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x27/0x40
__kasan_slab_free+0x126/0x1b0
kmem_cache_free+0x10c/0x750
bfq_put_queue+0x2dd/0x770
__bfq_insert_request.isra.0+0x155/0x7a0
bfq_insert_request.isra.0+0x122/0x480
bfq_insert_requests+0x156/0x180
blk_mq_dispatch_plug_list+0x528/0x7e0
blk_mq_flush_plug_list.part.0+0xe5/0x590
__blk_flush_plug+0x3b/0x90
blk_finish_plug+0x40/0x60
do_writepages+0x19d/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Fix the problem by protecting bic_to_bfqq() with bfqd->lock.
CC: Jan Kara <jack@suse.cz>
Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241129091509.2227136-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
syzbot triggered the following WARN_ON:
WARNING: CPU: 0 PID: 16 at io_uring/tctx.c:51 __io_uring_free+0xfa/0x140 io_uring/tctx.c:51
which is the
WARN_ON_ONCE(!xa_empty(&tctx->xa));
sanity check in __io_uring_free() when a io_uring_task is going through
its final put. The syzbot test case includes injecting memory allocation
failures, and it very much looks like xa_store() can fail one of its
memory allocations and end up with ->head being non-NULL even though no
entries exist in the xarray.
Until this issue gets sorted out, work around it by attempting to
iterate entries in our xarray, and WARN_ON_ONCE() if one is found.
Reported-by: syzbot+cc36d44ec9f368e443d3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/io-uring/673c1643.050a0220.87769.0066.GAE@google.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit ff123eb7741638d55abf82fac090bb3a543c1e74.
Allowing large pages for KASAN shadow mappings isn't inherently wrong,
but adding POPULATE_KASAN_MAP_SHADOW to large_allowed() exposes an issue
in can_large_pud() and can_large_pmd().
Since commit d8073dc6bc04 ("s390/mm: Allow large pages only for aligned
physical addresses"), both can_large_pud() and can_large_pmd() call _pa()
to check if large page physical addresses are aligned. However, _pa()
has a side effect: it allocates memory in POPULATE_KASAN_MAP_SHADOW
mode. This results in massive memory leaks.
The proper fix would be to address both large_allowed() and _pa()'s side
effects, but for now, revert this change to avoid the leaks.
Fixes: ff123eb77416 ("s390/mm: Allow large pages for KASAN shadow mapping")
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
A sigqueue belonging to a posix timer, which target is not a specific
thread but a whole thread group, is preferrably targeted to the current
task if it is part of that thread group.
However nothing prevents a posix timer event from queueing such a
sigqueue from a reaped yet running task. The interruptible code space
between exit_notify() and the final call to schedule() is enough for
posix_timer_fn() hrtimer to fire.
If that happens while the current task is part of the thread group
target, it is proposed to handle it but since its sighand pointer may
have been cleared already, the sigqueue is dropped even if there are
other tasks running within the group that could handle it.
As a result posix timers with thread group wide target may miss signals
when some of their threads are exiting.
Fix this with verifying that the current task hasn't been through
exit_notify() before proposing it as a preferred target so as to ensure
that its sighand is still here and stable.
complete_signal() might still reconsider the choice and find a better
target within the group if current has passed retarget_shared_pending()
already.
Fixes: bcb7ee79029d ("posix-timers: Prefer delivery of signals to the current thread")
Reported-by: Anthony Mallet <anthony.mallet@laas.fr>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241122234811.60455-1-frederic@kernel.org
Closes: https://lore.kernel.org/all/26411.57288.238690.681680@gargle.gargle.HOWL
|
|
Commit 157ce8f381ef ("i2c: Introduce OF component probe function") adds the
header file include/linux/i2c-of-prober.h and a corresponding file entry in
the newly added MAINTAINERS section I2C OF COMPONENT PROBER. This file
entry unfortunately has a typo.
Fortunately, ./scripts/get_maintainer.pl --self-test=patterns detects this
broken reference.
Fix the typo in this file entry in the I2C OF COMPONENT PROBER section.
Fixes: 157ce8f381ef ("i2c: Introduce OF component probe function")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|