aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/drivers/nvme/host/core.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds1-10/+195
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-20nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_diskNilay Shroff1-2/+2
In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-multipath: introduce delayed removal of the multipath head nodeNilay Shroff1-3/+5
Currently, the multipath head node of an NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-14nvme: all namespaces in a subsystem must adhere to a common atomic write sizeAlan Adamson1-3/+27
The first namespace configured in a subsystem sets the subsystem's atomic write size based on its AWUPF or NAWUPF. Subsequent namespaces must have an atomic write size (per their AWUPF or NAWUPF) less than or equal to the subsystem's atomic write size, or their probing will be rejected. Signed-off-by: Alan Adamson <alan.adamson@oracle.com> [hch: fold in review comments from John Garry] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-07block: remove the q argument from blk_rq_map_kernChristoph Hellwig1-1/+1
Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07nvme: unblock ctrl state transition for firmware updateDaniel Wagner1-1/+2
The original nvme subsystem design didn't have a CONNECTING state; the state machine allowed transitions from RESETTING to LIVE directly. With the introduction of nvme fabrics the CONNECTING state was introduce. Over time the nvme-pci started to use the CONNECTING state as well. Eventually, a bug fix for the nvme-fc started to depend that the only valid transition to LIVE was from CONNECTING. Though this change didn't update the firmware update handler which was still depending on RESETTING to LIVE transition. The simplest way to address it for the time being is to switch into CONNECTING state before going to LIVE state. Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Daniel Wagner <wagi@kernel.org> Closes: https://lore.kernel.org/all/0134ea15-8d5f-41f7-9e9a-d7e6d82accaa@roeck-us.net Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
2025-05-06nvme: fix incorrect sizeofKanchan Joshi1-1/+1
The plid array, head->plids, is meant to store placement IDs, each of type u16. But its size has been incorrectly calculated, as the size of the pointer is being used instead of the size of the object it points to. Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra. Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix write_stream_granularity initializationCaleb Sander Mateos1-1/+1
write_stream_granularity is set to max(info->runs, U32_MAX), which means that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger value is silently truncated to an unsigned int. Use min() instead to provide the correct semantics, capping RUNS values at U32_MAX. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: use fdp streams if write stream is providedKeith Busch1-1/+30
Maps a user requested write stream to an FDP placement ID if possible. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-12-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: register fdp parameters with the block layerKeith Busch1-0/+144
Register the device data placement limits if supported. This is just registering the limits with the block layer. Nothing beyond reporting these attributes is happening in this patch. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: pass a void pointer to nvme_get/set_features for the resultChristoph Hellwig1-2/+2
That allows passing in structures instead of the u32 result, and thus reduce the amount of bit shifting and masking required to parse the result. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-9-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add a nvme_get_log_lsi helperChristoph Hellwig1-2/+12
For log pages that need to pass in a LSI value, while at the same time not touching all the existing nvme_get_log callers. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-8-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16nvme: fixup scan failure for non-ANA multipath controllersHannes Reinecke1-1/+1
Commit 62baf70c3274 caused the ANA log page to be re-read, even on controllers that do not support ANA. While this should generally harmless, some controllers hang on the unsupported log page and never finish probing. Fixes: 62baf70c3274 ("nvme: re-read ANA log page after ns scan completes") Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Srikanth Aithal <sraithal@amd.com> [hch: more detailed commit message] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2025-04-08nvme: re-read ANA log page after ns scan completesHannes Reinecke1-0/+5
When scanning for new namespaces we might have missed an ANA AEN. The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous Event Information - Notice': Asymmetric Namespace Access Change) states: A controller shall not send this even if an Attached Namespace Attribute Changed asynchronous event [...] is sent for the same event. so we need to re-read the ANA log page after we rescanned the namespace list to update the ANA states of the new namespaces. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-07nvme: requeue namespace scan on missed AENsHannes Reinecke1-0/+4
Scanning for namespaces can take some time, so if the target is reconfigured while the scan is running we may miss a Attached Namespace Attribute Changed AEN. Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has finished, and requeue scanning to pick up any missed change. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-03Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - PCI endpoint target cleanup (Damien) - Early import for uring_cmd fixed buffer (Caleb) - Multipath documentation and notification improvements (John) - Invalid pci sq doorbell write fix (Maurizio) - Queue init locking fix - Remove dead nsegs parameter from blk_mq_get_new_requests() * tag 'block-6.15-20250403' of git://git.kernel.dk/linux: block: don't grab elevator lock during queue initialization nvme-pci: skip nvme_write_sq_db on empty rqlist nvme-multipath: change the NVME_MULTIPATH config option nvme: update the multipath warning in nvme_init_ns_head nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io() nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request() nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer nvmet: pci-epf: Keep completion queues mapped block: remove unused nseg parameter
2025-04-01Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mmLinus Torvalds1-4/+2
Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES*() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01nvme: update the multipath warning in nvme_init_ns_headJohn Meneghini1-1/+1
The new NVME_MULTIPATH_PARAM config option requires updates to the warning message in nvme_init_ns_head(). Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds1-0/+3
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-20nvme-multipath: Add visibility for round-robin io-policyNilay Shroff1-0/+3
This patch helps add nvme native multipath visibility for round-robin io-policy. It creates a "multipath" sysfs directory under head gendisk device node directory and then from "multipath" directory it adds a link to each namespace path device the head node refers. For instance, if we have a shared namespace accessible from two different controllers/paths then we create a soft link to each path device from head disk node as shown below: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For round-robin I/O policy, we could easily infer from the above output that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1 and nvme1c3n1. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-17nvme: convert timeouts to secs_to_jiffies()Easwar Hariharan1-4/+2
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() to avoid the multiplication This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies +secs_to_jiffies (E - * \( 1000 \| MSEC_PER_SEC \) ) Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-11-a43967e36c88@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Acked-by: Keith Busch <kbusch@kernel.org> Cc: Carlos Maiolino <cem@kernel.org> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Dongsheng Yang <dongsheng.yang@easystack.cn> Cc: Fabio Estevam <festevam@gmail.com> Cc: Frank Li <frank.li@nxp.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Brown <broonie@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Sebastian Reichel <sre@kernel.org> Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-11nvme: move error logging from nvme_end_req() to __nvme_end_req()Shin'ichiro Kawasaki1-6/+6
Before the Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions"), blk_mq_add_to_batch() did not add failed passthrough requests to batch, and returned false. After the commit, blk_mq_add_to_batch() always adds passthrough requests to batch regardless of whether the request failed or not, and returns true. This affected error logging feature in the NVME driver. Before the commit, the call chain of failed passthrough request was as follows: nvme_handle_cqe() blk_mq_add_to_batch() .. false is returned, then call nvme_pci_complete_rq() nvme_pci_complete_rq() nvme_complete_rq() nvme_end_req() nvme_log_err_passthru() .. error logging __nvme_end_req() .. end of the rqeuest After the commit, the call chain is as follows: nvme_handle_cqe() blk_mq_add_to_batch() .. true is returned, then set nvme_pci_complete_batch() .. nvme_pci_complete_batch() nvme_complete_batch() nvme_complete_batch_req() __nvme_end_req() .. end of the request, without error logging To make the error logging feature work again for passthrough requests, move the nvme_log_err_passthru() call from nvme_end_req() to __nvme_end_req(). While at it, move nvme_log_error() call for non-passthrough requests together with nvme_log_err_passthru(). Even though the trigger commit does not affect non-passthrough requests, move it together for code simplicity. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250311104359.1767728-2-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-20nvme: only allow entering LIVE from CONNECTING stateDaniel Wagner1-2/+0
The fabric transports and also the PCI transport are not entering the LIVE state from NEW or RESETTING. This makes the state machine more restrictive and allows to catch not supported state transitions, e.g. directly switching from RESETTING to LIVE. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-03Merge tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme into block-6.14Jens Axboe1-1/+7
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.14 - Connection fixes for fibre channel transport (Daniel) - Endian fixes (Keith, Christoph) - Cleanup fix for host memory buffer (Francis) - Platform specific power quirks (Georg) - Target memory leak (Sagi) - Use appropriate controller state accessor (Daniel)" * tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme: nvme-fc: use ctrl state getter nvme: make nvme_tls_attrs_group static nvmet: add a missing endianess conversion in nvmet_execute_admin_connect nvmet: the result field in nvmet_alloc_ctrl_args is little endian nvmet: fix a memory leak in controller identify nvme-fc: do not ignore connectivity loss during connecting nvme: handle connectivity loss in nvme_set_queue_count nvme-fc: go straight to connecting state when initializing nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk nvme-pci: remove redundant dma frees in hmb nvmet: fix rw control endian access
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig1-7/+10
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23nvme: handle connectivity loss in nvme_set_queue_countDaniel Wagner1-1/+7
When the set feature attempts fails with any NVME status code set in nvme_set_queue_count, the function still report success. Though the numbers of queues set to 0. This is done to support controllers in degraded state (the admin queue is still up and running but no IO queues). Though there is an exception. When nvme_set_features reports an host path error, nvme_set_queue_count should propagate this error as the connectivity is lost, which means also the admin queue is not working anymore. Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-20Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linuxLinus Torvalds1-8/+13
Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...
2025-01-17block: Add common atomic writes enable flagJohn Garry1-0/+1
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13nvme: fix bogus kzalloc() return check in nvme_init_effects_log()Jens Axboe1-1/+1
nvme_init_effects_log() returns failure when kzalloc() is successful, which is obviously wrong and causes failures to boot. Correct the check. Fixes: d4a95adeabc6 ("nvme: Add error path for xa_store in nvme_init_effects") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13Merge tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme into for-6.14/blockJens Axboe1-6/+28
Pull NVMe updates from Keith: "nvme updates for Linux 6.14 - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis)" * tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme: (25 commits) nvme-pci: use correct size to free the hmb buffer nvme: Add error path for xa_store in nvme_init_effects nvme-pci: fix comment typo Documentation: Document the NVMe PCI endpoint target driver nvmet: New NVMe PCI endpoint function target driver nvmet: Implement arbitration feature support nvmet: Implement interrupt config feature support nvmet: Implement interrupt coalescing feature support nvmet: Implement host identifier set feature support nvmet: Introduce get/set_feature controller operations nvmet: Do not require SGL for PCI target controller commands nvmet: Add support for I/O queue management admin commands nvmet: Introduce nvmet_sq_create() and nvmet_cq_create() nvmet: Introduce nvmet_req_transfer_len() nvmet: Improve nvmet_alloc_ctrl() interface and implementation nvme: Add PCI transport type nvmet: Add drvdata field to struct nvmet_ctrl nvmet: Introduce nvmet_get_cmd_effects_admin() nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers nvmet: Add vendor_id and subsys_vendor_id subsystem attributes ...
2025-01-12nvme: Add error path for xa_store in nvme_init_effectsKeisuke Nishimura1-4/+22
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index NVME_CSI_NVM is already used. This fix introduces a new function to handle the error path. Fixes: cc115cbe12d9 ("nvme: always initialize known command effects") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: Add error check for xa_store in nvme_get_effects_logKeisuke Nishimura1-2/+6
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index csi is already used. This fix adds an error check of the return value of xa_store() in nvme_get_effects_log(). Fixes: 1cf7a12e09aa ("nvme: use an xarray to lookup the Commands Supported and Effects log") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-10nvme: fix queue freeze vs limits lock orderChristoph Hellwig1-4/+5
Match the locking order used by the core block code by only freezing the queue after taking the limits lock. Unlike most queue updates this does not use the queue_limits_commit_update_frozen helper as the nvme driver want the queue frozen for more than just the limits update. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250110054726.1499538-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-06block: remove BLK_MQ_F_NO_SCHEDChristoph Hellwig1-1/+0
The only queues that really can't support a scheduler are those that do not have a gendisk associated with them, and thus can't be used for non-passthrough commands. In addition to those null_blk can optionally set the flag, which is a bad odd. Replace the null_blk usage with BLK_MQ_F_NO_SCHED_BY_DEFAULT to keep the expected semantics and then remove BLK_MQ_F_NO_SCHED as the non-disk queues never call into elevator_init_mq or blk_register_queue which adds the sysfs attributes. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250106083531.799976-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: remove BLK_MQ_F_SHOULD_MERGEChristoph Hellwig1-1/+0
BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely process passthrough commands (bsg-lib, ufs tmf, various nvme admin queues) and thus don't even check the flag. Remove it to simplify the driver interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23nvme: add support for passing on the application tagKanchan Joshi1-0/+10
With user integrity buffer, there is a way to specify the app_tag. Set the corresponding protocol specific flags and send the app_tag down. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-9-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flagsAnuj Gupta1-8/+3
This patch introduces BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags which indicate how the hardware should check the integrity payload. BIP_CHECK_GUARD/REFTAG are conversion of existing semantics, while BIP_CHECK_APPTAG is a new flag. The driver can now just rely on block layer flags, and doesn't need to know the integrity source. Submitter of PI decides which tags to check. This would also give us a unified interface for user and kernel generated integrity. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241128112240.8867-8-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18nvme: use blk_validate_block_size() for max LBA checkLuis Chamberlain1-1/+1
The block layer already has support to validates proper block sizes with blk_validate_block_size(), we can leverage that as well. No functional changes. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241218020212.3657139-3-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-05Merge tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme into block-6.13Jens Axboe1-3/+5
Pull NVMe fixess from Keith: "nvme fixes for Linux 6.13 - Target fix using incorrect zero buffer (Nilay) - Device specifc deallocate quirk fixes (Christoph, Keith) - Fabrics fix for handling max command target bugs (Maurizio) - Cocci fix usage for kzalloc (Yu-Chen) - DMA size fix for host memory buffer feature (Christoph) - Fabrics queue cleanup fixes (Chunguang)" * tag 'nvme-6.13-2024-12-05' of git://git.infradead.org/nvme: nvme-tcp: simplify nvme_tcp_teardown_io_queues() nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues() nvme-rdma: unquiesce admin_q before destroy it nvme-tcp: fix the memleak while create new ctrl failed nvme-pci: don't use dma_alloc_noncontiguous with 0 merge boundary nvmet: replace kmalloc + memset with kzalloc for data allocation nvme-fabrics: handle zero MAXCMD without closing the connection nvme-pci: remove two deallocate zeroes quirks nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supported nvmet: use kzalloc instead of ZERO_PAGE in nvme_execute_identify_ns_nvm()
2024-12-04nvme-fabrics: handle zero MAXCMD without closing the connectionMaurizio Lombardi1-2/+3
The NVMe specification states that MAXCMD is mandatory for NVMe-over-Fabrics implementations. However, some NVMe/TCP and NVMe/FC arrays from major vendors have buggy firmware that reports MAXCMD as zero in the Identify Controller data structure. Currently, the implementation closes the connection in such cases, completely preventing the host from connecting to the target. Fix the issue by printing a clear error message about the firmware bug and allowing the connection to proceed. It assumes that the target supports a MAXCMD value of SQSIZE + 1. If any issues arise, the user can manually adjust SQSIZE to mitigate them. Fixes: 4999568184e5 ("nvme-fabrics: check max outstanding commands") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra1-7/+7
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02nvme: don't apply NVME_QUIRK_DEALLOCATE_ZEROES when DSM is not supportedChristoph Hellwig1-1/+2
Commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard") started applying the NVME_QUIRK_DEALLOCATE_ZEROES quirk even then the Dataset Management is not supported. It turns out that there versions of these old Intel SSDs that have DSM support disabled in the firmware, which will now lead to errors everytime a Write Zeroes command is issued. Fix this by checking for DSM support before applying the quirk. Reported-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Fixes: 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard") Tested-by: Saeed Mirzamohammadi <saeed.mirzamohammadi@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-30Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linuxLinus Torvalds1-7/+15
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - Use correct srcu list traversal (Breno) - Scatter-gather support for metadata (Keith) - Fabrics shutdown race condition fix (Nilay) - Persistent reservations updates (Guixin) - Add the required bits for MD atomic write support for raid0/1/10 - Correct return value for unknown opcode in ublk - Fix deadlock with zone revalidation - Fix for the io priority request vs bio cleanups - Use the correct unsigned int type for various limit helpers - Fix for a race in loop - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make it easier for actual humans to read - Fix potential UAF when iterating tags - A few fixes for bfq-iosched UAF issues - Fix for brd discard not decrementing the allocated page count - Various little fixes and cleanups * tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits) brd: decrease the number of allocated pages which discarded block, bfq: fix bfqq uaf in bfq_limit_depth() block: Don't allow an atomic write be truncated in blkdev_write_iter() mq-deadline: don't call req_get_ioprio from the I/O completion handler block: Prevent potential deadlock in blk_revalidate_disk_zones() block: Remove extra part pointer NULLify in blk_rq_init() nvme: tuning pr code by using defined structs and macros nvme: introduce change ptpl and iekey definition block: return bool from get_disk_ro and bdev_read_only block: remove a duplicate definition for bdev_read_only block: return bool from blk_rq_aligned block: return unsigned int from blk_lim_dma_alignment_and_pad block: return unsigned int from queue_dma_alignment block: return unsigned int from bdev_io_opt block: req->bio is always set in the merge code block: don't bother checking the data direction for merges block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()" md/raid10: Atomic write support md/raid1: Atomic write support ...
2024-11-19nvme-fabrics: fix kernel crash while shutting down controllerNilay Shroff1-0/+5
The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below: Call Trace: autoremove_wake_function+0x0/0xbc (unreliable) __blk_mq_sched_dispatch_requests+0x114/0x24c blk_mq_sched_dispatch_requests+0x44/0x84 blk_mq_run_hw_queue+0x140/0x220 nvme_keep_alive_work+0xc8/0x19c [nvme_core] process_one_work+0x200/0x4e0 worker_thread+0x340/0x504 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvme_keep_alive_end_io function is then invoked to handle the end of the keep-alive operation which decrements the admin->q_usage_counter and assuming this is the last/only request in the admin queue then the admin->q_usage_counter becomes zero. If that happens then blk-mq destroy queue operation (blk_mq_destroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash. The above kernel crash is regression caused due to changes implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"). Ideally we should stop keep-alive before destroyin g the admin queue and freeing the admin tagset so that it wouldn't sneak in during the shutdown operation. However we removed the keep alive stop operation from the beginning of the controller shutdown code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()") and added it under nvme_uninit_ctrl() which executes very late in the shutdown code path after the admin queue is destroyed and its tagset is removed. So this change created the possibility of keep-alive sneaking in and interfering with the shutdown operation and causing observed kernel crash. To fix the observed crash, we decided to move nvme_stop_keep_alive() from nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure that we don't forward progress and delete the admin queue until the keep- alive operation is finished (if it's in-flight) or cancelled and that would help contain the race condition explained above and hence avoid the crash. Moving nvme_stop_keep_alive() to nvme_remove_admin_tag_set() instead of adding nvme_stop_keep_alive() to the beginning of the controller shutdown code path in nvme_stop_ctrl(), as was the case earlier before commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"), would help save one callsite of nvme_stop_keep_alive(). Fixes: a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()") Link: https://lore.kernel.org/all/1a21f37b-0f2a-4745-8c56-4dc8628d3983@linux.ibm.com/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-19Revert "nvme: make keep-alive synchronous operation"Nilay Shroff1-7/+10
This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c. It was realized that the fix implemented to contain the race condition among the keep alive task and the fabric shutdown code path in the commit d06923670b5ia ("nvme: make keep-alive synchronous operation") is not optimal. The reason being keep-alive runs under the workqueue and making it synchronous would waste a workqueue context. Furthermore, we later found that the above race condition is a regression caused due to the changes implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit d06923670b5a ("nvme: make keep-alive synchronous operation") and then fix the regression. Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-18Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linuxLinus Torvalds1-12/+26
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...
2024-11-13Merge tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme into for-6.13/blockJens Axboe1-10/+19
Pull NVMe updates from Keith: "nvme updates for Linux 6.13 - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen)" * tag 'nvme-6.13-2024-11-13' of git://git.infradead.org/nvme: (22 commits) nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available nvmet: support for csi identify ns nvmet: implement rotational media information log nvmet: implement endurance groups nvmet: declare 2.1 version compliance nvmet: implement crto property nvmet: implement supported features log nvmet: implement supported log pages nvmet: implement active command set ns list nvmet: implement id ns for nvm command set nvmet: support reservation feature nvme: add reservation command's defines nvme-core: remove repeated wq flags nvmet: make nvmet_wq visible in sysfs nvme-pci: use dma_alloc_noncontigous if possible ...
2024-11-11nvme: check ns's volatile write cache not presentGuixin Liu1-1/+3
When the VWC of a namespace does not exist, the BLK_FEAT_WRITE_CACHE flag should not be set when registering the block device, regardless of whether the controller supports VWC. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvme: add rotational supportWang Yugui1-0/+6
Rotational devices, such as hard-drives, can be detected using the rotational bit in the namespace independent identify namespace data structure. Make the bit visible to the block layer through the rotational queue setting. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvme: use command set independent id ns if availableMatias Bjørling1-3/+4
The NVMe 2.0 specification adds an independent identify namespace data structure that contains generic attributes that apply to all namespace types. Some attributes carry over from the NVM command set identify namespace data structure, and others are new. Currently, the data structure only considered when CRIMS is enabled or when the namespace type is key-value. However, the independent namespace data structure is mandatory for devices that implement features from the 2.0+ specification. Therefore, we can check this data structure first. If unavailable, retrieve the generic attributes from the NVM command set identify namespace data structure. Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>