aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/nvme (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-23nvme/tcp: handle tls partially sent records in write_space()Wilfred Mallawa1-0/+3
[ Upstream commit 5a869d017793399fd1d2609ff27e900534173eb3 ] With TLS enabled, records that are encrypted and appended to TLS TX list can fail to see a retry if the underlying TCP socket is busy, for example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known to the NVMe TCP driver, as the TLS layer successfully generated a record. Typically, the TLS write_space() callback would ensure such records are retried, but in the NVMe TCP Host driver, write_space() invokes nvme_tcp_write_space(). This causes a partially sent record in the TLS TX list to timeout after not being retried. This patch fixes the above by calling queue->write_space(), which calls into the TLS layer to retry any pending records. Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23nvme-auth: update sc_c in host responseMartin George1-1/+5
[ Upstream commit 7e091add9c433bab6912228799bf508e2414acc3 ] The sc_c field is currently not updated in the host response to the controller challenge leading to failures while attempting secure channel concatenation. Fix this by adding a new sc_c variable to the dhchap queue context structure which is appropriately set during negotiate and then used in the host response. Fixes: e88a7595b57f ("nvme-tcp: request secure channel concatenation") Signed-off-by: Martin George <marting@netapp.com> Signed-off-by: Prashanth Adurthi <prashana@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23nvme-multipath: Skip nr_active increments in RETRY dispositionAmit Chaudhary1-2/+4
[ Upstream commit bb642e2d300ee27dcede65cda7ffc47a7047bd69 ] For queue-depth I/O policy, this patch fixes unbalanced I/Os across nvme multipaths. Issue Description: The RETRY disposition incorrectly increments ns->ctrl->nr_active counter and reinitializes iostat start-time. In such cases nr_active counter never goes back to zero until that path disconnects and reconnects. Such a path is not chosen for new I/Os if multiple RETRY cases on a given a path cause its queue-depth counter to be artificially higher compared to other paths. This leads to unbalanced I/Os across paths. The patch skips incrementing nr_active if NVME_MPATH_CNT_ACTIVE is already set. And it skips restarting io stats if NVME_MPATH_IO_STATS is already set. base-commit: e989a3da2d371a4b6597ee8dee5c72e407b4db7a Fixes: d4d957b53d91eeb ("nvme-multipath: support io stats on the mpath device") Signed-off-by: Amit Chaudhary <achaudhary@purestorage.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19nvme-pci: Add TUXEDO IBS Gen8 to Samsung sleep quirkGeorg Gottleuber1-0/+2
commit eeaed48980a7aeb0d3d8b438185d4b5a66154ff9 upstream. On the TUXEDO InfinityBook S Gen8, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (3.5 watts). This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 1 watts. Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15nvme-tcp: send only permitted commands for secure concatMartin George1-0/+3
[ Upstream commit df4666a4908a6d883f628f93a3e6c80981332035 ] In addition to sending permitted commands such as connect/auth over the initial unencrypted admin connection as part of secure channel concatenation, the host also sends commands such as Property Get and Identify on the same. This is a spec violation leading to secure concat failures. Fix this by ensuring these additional commands are avoided on this connection. Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15nvmet-fcloop: call done callback even when remote port is goneDaniel Wagner1-3/+5
[ Upstream commit 10c165af35d225eb033f4edc7fcc699a8d2d533d ] When the target port is gone, it's not possible to access any of the request resources. The function should just silently drop the response. The comment is misleading in this regard. Though it's still necessary to call the driver via the ->done callback so the driver is able to release all resources. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs-OBA0WMt5f7R0dz+rR4HcEz19YLhnyGsj-MRV3jWDsPg@mail.gmail.com/ Fixes: 84eedced1c5b ("nvmet-fcloop: drop response if targetport is gone") Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15nvmet-fc: move lsop put work to nvmet_fc_ls_req_opDaniel Wagner1-10/+9
[ Upstream commit db5a5406fb7e5337a074385c7a3e53c77f2c1bd3 ] It’s possible for more than one async command to be in flight from __nvmet_fc_send_ls_req. For each command, a tgtport reference is taken. In the current code, only one put work item is queued at a time, which results in a leaked reference. To fix this, move the work item to the nvmet_fc_ls_req_op struct, which already tracks all resources related to the command. Fixes: 710c69dbaccd ("nvmet-fc: avoid deadlock on delete association path") Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-15nvme-auth: update bi_directional flagMartin George1-2/+3
[ Upstream commit 6ff1bd7846680dfdaafc68d7fcd0ab7e3bcbc4a0 ] While setting chap->s2 to zero as part of secure channel concatenation, the host missed out to disable the bi_directional flag to indicate that controller authentication is not requested. Fix the same. Fixes: e88a7595b57f ("nvme-tcp: request secure channel concatenation") Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-02nvme: fix PI insert on writeChristoph Hellwig1-7/+11
I recently ran into an issue where the PI generated using the block layer integrity code differs from that from a kernel using the PRACT fallback when the block layer integrity code is disabled, and I tracked this down to us using PRACT incorrectly. The NVM Command Set Specification (section 5.33 in 1.2, similar in older versions) specifies the PRACT insert behavior as: Inserted protection information consists of the computed CRC for the protection information format (refer to section 5.3.1) in the Guard field, the LBAT field value in the Application Tag field, the LBST field value in the Storage Tag field, if defined, and the computed reference tag in the Logical Block Reference Tag. Where the computed reference tag is defined as following for type 1 and type 2 using the text below that is duplicated in the respective bullet points: the value of the computed reference tag for the first logical block of the command is the value contained in the Initial Logical Block Reference Tag (ILBRT) or Expected Initial Logical Block Reference Tag (EILBRT) field in the command, and the computed reference tag is incremented for each subsequent logical block. So we need to set ILBRT field, but we currently don't. Interestingly this works fine on my older type 1 formatted SSD, but Qemu trips up on this. We already set ILBRT for Write Same since commit aeb7bb061be5 ("nvme: set the PRACT bit when using Write Zeroes with T10 PI"). To ease this, move the PI type check into nvme_set_ref_tag. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-08-07nvmet: exit debugfs after discovery subsystem exitsMohamed Khalfella1-1/+1
Commit 528589947c180 ("nvmet: initialize discovery subsys after debugfs is initialized") changed nvmet_init() to initialize nvme discovery after "nvmet" debugfs directory is initialized. The change broke nvmet_exit() because discovery subsystem now depends on debugfs. Debugfs should be destroyed after discovery subsystem. Fix nvmet_exit() to do that. Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs96AfFQpyDKF_MdfJsnOEo=2V7dQgqjFv+k3t7H-=yGhA@mail.gmail.com/ Fixes: 528589947c180 ("nvmet: initialize discovery subsys after debugfs is initialized") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Link: https://lore.kernel.org/r/20250807053507.2794335-1-mkhalfella@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-31nvme: fix various comment typosBjorn Helgaas4-9/+9
Fix typos in comments. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-31nvme-auth: remove unneeded semicolonJiapeng Chong1-2/+2
No functional modification involved. ./drivers/nvme/host/auth.c:745:2-3: Unneeded semicolon. ./drivers/nvme/host/auth.c:755:2-3: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=22937 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-31nvme-pci: fix leak on sgl setup errorKeith Busch1-1/+1
We need to free the descriptor that was allocated. We also don't necessarily need to unmap each sgl entry, which was previously being attempted unconditionally. Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-31nvmet: initialize discovery subsys after debugfs is initializedMohamed Khalfella1-6/+6
During nvme target initialization discovery subsystem is initialized before "nvmet" debugfs directory is created. This results in discovery subsystem debugfs directory to be created in debugfs root directory. nvmet_init() -> nvmet_init_discovery() -> nvmet_subsys_alloc() -> nvmet_debugfs_subsys_setup() In other words, the codepath above is exeucted before nvmet_debugfs is created. We get /sys/kernel/debug/nqn.2014-08.org.nvmexpress.discovery instead of /sys/kernel/debug/nvmet/nqn.2014-08.org.nvmexpress.discovery. Move nvmet_init_discovery() call after nvmet_init_debugfs() to fix it. Fixes: 649fd41420a8 ("nvmet: add debugfs support") Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-31nvme: add capability to connect to an administrative controllerKamaljit Singh1-0/+16
Add capability to connect to an administrative controller by preventing ioq creation for admin-controllers. Add a nvme_admin_ctrl() to check if a controller's CNTRLTYPE indicates that it is an administrative controller and override ctrl->queue_count to 1 for admin controllers, so that only the admin queue and no I/O queues are created for an administrative controller. This override is done in nvme_init_ctrl_finish() after ctrl->cntrltype has been initialized in nvme_init_identify() so nvme_admin_ctrl() will work correctly. Doing this override in generic code (nvme_init_ctrl_finish) makes it transport agnostic and will work properly for nvme/tcp as well as for nvme/rdma. Suggested-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Kamaljit Singh <kamaljit.singh1@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-31nvmet: add support for FDP in fabrics passthru pathNitesh Shetty1-0/+2
Add support for admin_get_feature FDP(0x1d) feature id, thus enabling FDP at the initiator side for the target controller and namespaces attached to it. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-28Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds12-285/+423
Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
2025-07-28Merge tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds2-3/+6
Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size
2025-07-28Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds2-9/+15
Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2025-07-18Merge tag 'block-6.16-20250718' of git://git.kernel.dk/linuxLinus Torvalds2-18/+13
Pull block fixes from Jens Axboe: - NVMe changes via Christoph: - revert the cross-controller atomic write size validation that caused regressions (Christoph Hellwig) - fix endianness of command word printout in nvme_log_err_passthru() (John Garry) - fix callback lock for TLS handshake (Maurizio Lombardi) - fix misaccounting of nvme-mpath inflight I/O (Yu Kuai) - fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() (Zheng Qixing) - Fix for a kobject leak in queue unregistration - Fix for loop async file write start/end handling * tag 'block-6.16-20250718' of git://git.kernel.dk/linux: loop: use kiocb helpers to fix lockdep warning nvmet-tcp: fix callback lock for TLS handshake nvme: fix misaccounting of nvme-mpath inflight I/O nvme: revert the cross-controller atomic write size validation nvme: fix endianness of command word prints in nvme_log_err_passthru() nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list() block: fix kobject leak in blk_unregister_queue
2025-07-17nvme-pci: try function level reset on init failureKeith Busch1-2/+22
NVMe devices from multiple vendors appear to get stuck in a reset state that we can't get out of with an NVMe level Controller Reset. The kernel would report these with messages that look like: Device not ready; aborting reset, CSTS=0x1 These have historically required a power cycle to make them usable again, but in many cases, a PCIe FLR is sufficient to restart operation without a power cycle. Try it if the initial controller reset fails during any nvme reset attempt. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() failsRick Wertenbroek1-7/+16
Have nvmet_req_init() and req->execute() complete failed commands. Description of the problem: nvmet_req_init() calls __nvmet_req_complete() internally upon failure, e.g., unsupported opcode, which calls the "queue_response" callback, this results in nvmet_pci_epf_queue_response() being called, which will call nvmet_pci_epf_complete_iod() if data_len is 0 or if dma_dir is different from DMA_TO_DEVICE. This results in a double completion as nvmet_pci_epf_exec_iod_work() also calls nvmet_pci_epf_complete_iod() when nvmet_req_init() fails. Steps to reproduce: On the host send a command with an unsupported opcode with nvme-cli, For example the admin command "security receive" $ sudo nvme security-recv /dev/nvme0n1 -n1 -x4096 This triggers a double completion as nvmet_req_init() fails and nvmet_pci_epf_queue_response() is called, here iod->dma_dir is still in the default state of "DMA_NONE" as set by default in nvmet_pci_epf_alloc_iod(), so nvmet_pci_epf_complete_iod() is called. Because nvmet_req_init() failed nvmet_pci_epf_complete_iod() is also called in nvmet_pci_epf_exec_iod_work() leading to a double completion. This not only sends two completions to the host but also corrupts the state of the PCI NVMe target leading to kernel oops. This patch lets nvmet_req_init() and req->execute() complete all failed commands, and removes the double completion case in nvmet_pci_epf_exec_iod_work() therefore fixing the edge cases where double completions occurred. Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvme-tcp: log TLS handshake failures at error levelMaurizio Lombardi1-3/+8
Update the nvme_tcp_start_tls() function to use dev_err() instead of dev_dbg() when a TLS error is detected. This ensures that handshake failures are visible by default, aiding in debugging. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvme: fix typo in status code constant for self-test in progressAlok Tiwari1-2/+2
Correct a typo error in the NVMe status code constant from NVME_SC_SELT_TEST_IN_PROGRESS to NVME_SC_SELF_TEST_IN_PROGRESS to accurately reflect its meaning. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvmet: remove redundant assignment of error code in nvmet_ns_enable()Alok Tiwari1-2/+0
Remove the unnecessary ret = -EMFILE; assignment since it is immediately overwritten by the result of nvmet_bdev_ns_enable() The initial value (-EMFILE) is redundant because it has no effect on the code logic or outcome. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvme: fix incorrect variable in io cqes error messageAlok Tiwari1-1/+1
Correct the error log to print ctrl->io_cqes instead of incorrectly using ctrl->io_sqes for the io cqes size check. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvme: fix multiple spelling and grammar issues in host driversAlok Tiwari8-14/+14
This commit fixes several typos and grammatical issues across various nvme host driver files: - correct "glace" to "glance" in a comment in apple.c - fix "Idependent" to "Independent" in core.c - change "unsucceesful" to "unsuccessful", "they blk-mq" to "the blk-mq", - fix "terminaed" to "terminated" and other grammar in fc.c - update "O's" to "0's" to clarify meaning in nvme.h - fix a function name reference in a comment in zns.c: *_transter_len() -> *_transfer_len(). - fix sysfs_emit() output format in pci.c (replace x%08x with 0x%08x) These changes improve the code readability and documentation consistency across the NVMe driver. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvmet-tcp: fix callback lock for TLS handshakeMaurizio Lombardi1-2/+2
When restoring the default socket callbacks during a TLS handshake, we need to acquire a write lock on sk_callback_lock. Previously, a read lock was used, which is insufficient for modifying sk_user_data and sk_data_ready. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: fix misaccounting of nvme-mpath inflight I/OYu Kuai1-0/+4
Procedures for nvme-mpath IO accounting: 1) initialize nvme_request and clear flags; 2) set NVME_MPATH_IO_STATS and increase inflight counter when IO started; 3) check NVME_MPATH_IO_STATS and decrease inflight counter when IO is done; However, for the case nvme_fail_nonready_command(), both step 1) and 2) are skipped, and if old nvme_request set NVME_MPATH_IO_STATS and then request is reused, step 3) will still be executed, causing inflight I/O counter to be negative. Fix the problem by clearing nvme_request in nvme_fail_nonready_command(). Fixes: ea5e5f42cd2c ("nvme-fabrics: avoid double completions in nvmf_fail_nonready_command") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_+dauobyYyP805t33WMJVzOWj=7+51p4_j9rA63D9sog@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: revert the cross-controller atomic write size validationChristoph Hellwig1-9/+0
This was originally added by commit 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") to check the all controllers in a subsystem report the same atomic write size, but the check wasn't quite correct and caused problems for devices with multiple namespaces that report different LBA sizes. Commit f46d273449ba ("nvme: fix atomic write size validation") tried to fix this, but then caused problems for namespace rediscovery after a format with an LBA size change that changes the AWUPF value. This drops the validation and essentially reverts those two commits while keeping the cleanup that went in between the two. We'll need to figure out how to properly check for the mouse trap that nvme left us, but for now revert the check to keep devices working for users who couldn't care less about the atomic write feature. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Fixes: f46d273449ba ("nvme: fix atomic write size validation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Alan Adamson <alan.adamson@oracle.com>
2025-07-14nvme: fix endianness of command word prints in nvme_log_err_passthru()John Garry1-6/+6
The command word members of struct nvme_common_command are __le32 type, so use helper le32_to_cpu() to read them properly. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-14nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list()Zheng Qixing1-1/+1
When inserting a namespace into the controller's namespace list, the function uses list_add_rcu() when the namespace is inserted in the middle of the list, but falls back to a regular list_add() when adding at the head of the list. This inconsistency could lead to race conditions during concurrent access, as users might observe a partially updated list. Fix this by consistently using list_add_rcu() in both code paths to ensure proper RCU protection throughout the entire function. Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-11nvme-pci: don't allocate dma_vec for IOVA mappingsChristoph Hellwig1-2/+2
Not only do IOVA mappings no need the separate dma_vec tracking, it also won't free it and thus leak the allocations. Fixes: b8b7570a7ec8 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Reported-by: Klara Modin <klarasmodin@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Klara Modin <klarasmodin@gmail.com> Link: https://lore.kernel.org/r/20250711112250.633269-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-08nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mappingChristoph Hellwig1-52/+62
The current version of the blk_rq_dma_map support in nvme-pci tries to reconstruct the DMA mappings from the on the wire descriptors if they are needed for unmapping. While this is not the case for the direct mapping fast path and the IOVA path, it is needed for the non-IOVA slow path, e.g. when using the interconnect is not dma coherent, when using swiotlb bounce buffering, or a IOMMU mapping that can't coalesce. While the reconstruction is easy and works fine for the SGL path, where the on the wire representation maps 1:1 to DMA mappings, the code to reconstruct the DMA mapping ranges from PRPs can't always work, as a given PRP layout can come from different DMA mappings, and the current code doesn't even always get that right. Give up on this approach and track the actual DMA mapping when actually needed again. Fixes: 7ce3c1dd78fc ("nvme-pci: convert the data mapping to blk_rq_dma_map") Reported-by: Ben Copeland <ben.copeland@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250707125223.3022531-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-04Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linuxLinus Torvalds4-6/+28
Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-01nvme-pci: use block layer helpers to calculate num of queuesDaniel Wagner1-2/+3
The calculation of the upper limit for queues does not depend solely on the number of possible CPUs; for example, the isolcpus kernel command-line option must also be considered. To account for this, the block layer provides a helper function to retrieve the maximum number of queues. Use it to set an appropriate upper queue number limit. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-3-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONEAnuj Gupta1-2/+3
protection information is treated as opaque when checksum type is BLK_INTEGRITY_CSUM_NONE. In order to maintain the right metadata semantics, set pi_offset only in cases where checksum type is not BLK_INTEGRITY_CSUM_NONE. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-4-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: introduce pi_tuple_size field in blk_integrityAnuj Gupta1-0/+2
Introduce a new pi_tuple_size field in struct blk_integrity to explicitly represent the size (in bytes) of the protection information (PI) tuple. This is a prep patch. Add validation in blk_validate_integrity_limits() to ensure that pi size matches the expected size for known checksum types and never exceeds the pi_tuple_size. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: rename tuple_size field in blk_integrity to metadata_sizeAnuj Gupta2-2/+2
The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01nvme-multipath: fix suspicious RCU usage warningGeliang Tang1-1/+2
When I run the NVME over TCP test in virtme-ng, I get the following "suspicious RCU usage" warning in nvme_mpath_add_sysfs_link(): ''' [ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77. [ 5.027401][ T183] nvme nvme0: creating 2 I/O queues. [ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues. [ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77 [ 5.042214][ T25] [ 5.042440][ T25] ============================= [ 5.042579][ T25] WARNING: suspicious RCU usage [ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted [ 5.042812][ T25] ----------------------------- [ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!! [ 5.043111][ T25] [ 5.043111][ T25] other info that might help us debug this: [ 5.043111][ T25] [ 5.043341][ T25] [ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1 [ 5.043502][ T25] 3 locks held by kworker/u9:0/25: [ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350 [ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350 [ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0 [ 5.044300][ T25] [ 5.044300][ T25] stack backtrace: [ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full) [ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 5.044442][ T25] Workqueue: async async_run_entry_fn [ 5.044445][ T25] Call Trace: [ 5.044446][ T25] <TASK> [ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0 [ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0 [ 5.044459][ T25] ? queue_work_on+0x90/0xf0 [ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110 [ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0 [ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0 [ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10 [ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580 [ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0 [ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240 [ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10 [ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10 [ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0 [ 5.044498][ T25] ? validate_chain+0x232/0x4f0 [ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810 [ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10 [ 5.044508][ T25] ? find_held_lock+0x2b/0x80 [ 5.044512][ T25] ? ktime_get+0x16d/0x220 [ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30 [ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10 [ 5.044522][ T25] async_run_entry_fn+0x97/0x560 [ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044526][ T25] process_one_work+0xd3c/0x1350 [ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10 [ 5.044536][ T25] ? assign_work+0x16c/0x240 [ 5.044539][ T25] worker_thread+0x4da/0xd50 [ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10 [ 5.044546][ T25] kthread+0x356/0x5c0 [ 5.044548][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180 [ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044557][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044559][ T25] ret_from_fork+0x218/0x2e0 [ 5.044561][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30 [ 5.044570][ T25] </TASK> ''' This patch uses sleepable RCU version of helper list_for_each_entry_srcu() instead of list_for_each_entry_rcu() to fix it. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme-pci: rework the build time assert for NVME_MAX_NR_DESCRIPTORSChristoph Hellwig1-13/+15
The current use of an always_inline helper is a bit convoluted. Instead use macros that represent the arithmetics used for building up the PRP chain. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: replace NVME_MAX_KB_SZ with NVME_MAX_BYTEChristoph Hellwig1-5/+5
Having a define in kiB units is a bit weird. Also update the comment now that there is not scatterlist limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: convert the data mapping to blk_rq_dma_mapChristoph Hellwig1-146/+241
Use the blk_rq_dma_map API to DMA map requests instead of scatterlists. This removes the need to allocate a scatterlist covering every segment, and thus the overall transfer length limit based on the scatterlist allocation. Instead the DMA mapping is done by iterating the bio_vec chain in the request directly. The unmap is handled differently depending on how we mapped: - when using an IOMMU only a single IOVA is used, and it is stored in iova_state - for direct mappings that don't use swiotlb and are cache coherent, unmap is not needed at all - for direct mappings that are not cache coherent or use swiotlb, the physical addresses are rebuild from the PRPs or SGL segments The latter unfortunately adds a fair amount of code to the driver, but it is code not used in the fast path. The conversion only covers the data mapping path, and still uses a scatterlist for the multi-segment metadata case. I plan to convert that as soon as we have good test coverage for the multi-segment metadata path. Thanks to Chaitanya Kulkarni for an initial attempt at a new DMA API conversion for nvme-pci, Kanchan Joshi for bringing back the single segment optimization, Leon Romanovsky for shepherding this through a gazillion rebases and Nitesh Shetty for various improvements. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: remove superfluous argumentsChristoph Hellwig1-54/+51
The call chain in the prep_rq and completion paths passes around a lot of nvme_dev, nvme_queue and nvme_command arguments that can be trivially derived from the passed in struct request. Remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: merge the simple PRP and SGL setup into a common helperChristoph Hellwig1-44/+32
nvme_setup_prp_simple and nvme_setup_sgl_simple share a lot of logic. Merge them into a single helper that makes use of the previously added use_sgl tristate. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: refactor nvme_pci_use_sglsChristoph Hellwig1-14/+27
Move the average segment size into a separate helper, and return a tristate to distinguish the case where can use SGL vs where we have to use SGLs. This will allow the simplify the code and make more efficient decisions in follow on changes. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250625113531.522027-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30nvme-pci: refresh visible attrs after being checkedEugen Hristev1-2/+4
The sysfs attributes are registered early, but the driver does not know whether they are needed or not at that moment. For the CMB attributes, commit e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") solved this problem by calling nvme_update_attrs after mapping the CMB. However the issue persists for the HMB attributes. To solve the problem, moved the call to nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB. Fixes: e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") Fixes: 86adbf0cdb9e ("nvme: simplify transport specific device attribute handling") Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvmet: fix memory leak of bio integrityDmitry Bogdanov1-0/+2
If nvmet receives commands with metadata there is a continuous memory leak of kmalloc-128 slab or more precisely bio->bi_integrity. Since commit bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") each user of bio_init has to use bio_uninit as well. Otherwise the bio integrity is not getting free. Nvmet uses bio_init for inline bios. Uninit the inline bio to complete deallocation of integrity in bio. Fixes: bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: correctly account for namespace head reference counterNilay Shroff2-2/+19
The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: Fix incorrect cdw15 value in passthru error loggingAlok Tiwari1-1/+1
Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>