aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/drivers/nvme/host/core.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2024-11-11nvme: add rotational supportWang Yugui1-0/+6
Rotational devices, such as hard-drives, can be detected using the rotational bit in the namespace independent identify namespace data structure. Make the bit visible to the block layer through the rotational queue setting. Signed-off-by: Wang Yugui <wangyugui@e16-tech.com> Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvme: use command set independent id ns if availableMatias Bjørling1-3/+4
The NVMe 2.0 specification adds an independent identify namespace data structure that contains generic attributes that apply to all namespace types. Some attributes carry over from the NVM command set identify namespace data structure, and others are new. Currently, the data structure only considered when CRIMS is enabled or when the namespace type is key-value. However, the independent namespace data structure is mandatory for devices that implement features from the 2.0+ specification. Therefore, we can check this data structure first. If unavailable, retrieve the generic attributes from the NVM command set identify namespace data structure. Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement rotational media information logKeith Busch1-0/+1
Most of the information is stubbed. Supporting these commands is a requirement for supporting rotational media. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-11nvmet: implement endurance groupsKeith Busch1-0/+1
Most of the returned information is just stubbed data. The target must support these in order to report rotational media. Since this driver doesn't know any better, each namespace is its own endurance group with the engid value matching the nsid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-09Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linuxLinus Torvalds1-7/+14
Pull block fix from Jens Axboe: "Single fix for an issue triggered with PROVE_RCU=y, with nvme using the wrong iterators for an SRCU protected list" * tag 'block-6.12-20241108' of git://git.kernel.dk/linux: nvme/host: Fix RCU list traversal to use SRCU primitive
2024-11-05nvme-core: remove repeated wq flagsChaitanya Kulkarni1-6/+4
In nvme_core_init() nvme_wq, nvme_reset_wq, nvme_delete_wq share same flags :- WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_SYSFS. Insated of repeating these flags in each call use the common variable. Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-04nvme/host: Fix RCU list traversal to use SRCU primitiveBreno Leitao1-7/+14
The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled: drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!! While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning. Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-11-01Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linuxLinus Torvalds1-14/+42
Pull block fixes from Jens Axboe: - Fixup for a recent blk_rq_map_user_bvec() patch - NVMe pull request via Keith: - Spec compliant identification fix (Keith) - Module parameter to enable backward compatibility on unusual namespace formats (Keith) - Target double free fix when using keys (Vitaliy) - Passthrough command error handling fix (Keith) * tag 'block-6.12-20241101' of git://git.kernel.dk/linux: nvme: re-fix error-handling for io_uring nvme-passthrough nvmet-auth: assign dh_key to NULL after kfree_sensitive nvme: module parameter to disable pi with offsets block: fix queue limits checks in blk_rq_map_user_bvec for real nvme: enhance cns version checking
2024-10-30nvme: module parameter to disable pi with offsetsKeith Busch1-2/+17
A recent commit enables integrity checks for formats the previous kernel versions registered with the "nop" integrity profile. This means namespaces using that format become unreadable when upgrading the kernel past that commit. Introduce a module parameter to restore the "nop" integrity profile so that storage can be readable once again. This could be a boot device, so the setting needs to happen at module load time. Fixes: 921e81db524d17 ("nvme: allow integrity when PI is not in first bytes") Reported-by: David Wei <dw@davidwei.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-26nvme: core: switch to non_owner variant of start_freeze/unfreeze queueMing Lei1-2/+7
nvme_start_freeze() and nvme_unfreeze() may be called from same context, so switch them to call non_owner variant of start_freeze/unfreeze queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22nvme: enhance cns version checkingKeith Busch1-12/+25
The number of CNS bits in the command is specific to the nvme spec version compliance. The existing check is not sufficient for possible CNS values the driver uses that may create confusion between host and device, so enhance the check to consider the version and desired CNS value. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-18Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linuxLinus Torvalds1-24/+17
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fix target passthrough identifier (Nilay) - Fix tcp locking (Hannes) - Replace list with sbitmap for tracking RDMA rsp tags (Guixen) - Remove unnecessary fallthrough statements (Tokunori) - Remove ready-without-media support (Greg) - Fix multipath partition scan deadlock (Keith) - Fix concurrent PCI reset and remove queue mapping (Maurizio) - Fabrics shutdown fixes (Nilay) - Fix for a kerneldoc warning (Keith) - Fix a race with blk-rq-qos and wakeups (Omar) - Cleanup of checking for always-set tag_set (SurajSonawane2415) - Fix for a crash with CPU hotplug notifiers (Ming) - Don't allow zero-copy ublk on unprivileged device (Ming) - Use array_index_nospec() for CDROM (Josh) - Remove dead code in drbd (David) - Tweaks to elevator loading (Breno) * tag 'block-6.12-20241018' of git://git.kernel.dk/linux: cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed() nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function nvme: make keep-alive synchronous operation nvme-loop: flush off pending I/O while shutting down loop controller nvme-pci: fix race condition between reset and nvme_dev_disable() ublk: don't allow user copy for unprivileged device blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race nvme-multipath: defer partition scanning blk-mq: setup queue ->tag_set before initializing hctx elevator: Remove argument from elevator_find_get elevator: do not request_module if elevator exists drbd: Remove unused conn_lowest_minor nvme: disable CC.CRIME (NVME_CC_CRIME) nvme: delete unnecessary fallthru comment nvmet-rdma: use sbitmap to replace rsp free list block: Fix elevator_get_default() checking for NULL q->tag_set nvme: tcp: avoid race between queue_lock lock and destroy nvmet-passthru: clear EUID/NGUID/UUID while using loop target block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-17nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish functionNilay Shroff1-8/+2
We no more need acquiring ctrl->lock before accessing the NVMe controller state and instead we can now use the helper nvme_ctrl_state. So replace the use of ctrl->lock from nvme_keep_alive_finish function with nvme_ctrl_state call. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-17nvme: make keep-alive synchronous operationNilay Shroff1-10/+7
The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below: Call Trace: autoremove_wake_function+0x0/0xbc (unreliable) __blk_mq_sched_dispatch_requests+0x114/0x24c blk_mq_sched_dispatch_requests+0x44/0x84 blk_mq_run_hw_queue+0x140/0x220 nvme_keep_alive_work+0xc8/0x19c [nvme_core] process_one_work+0x200/0x4e0 worker_thread+0x340/0x504 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvme_keep_alive_end_io function is then invoked to handle the end of the keep-alive operation which decrements the admin->q_usage_counter and assuming this is the last/only request in the admin queue then the admin->q_usage_counter becomes zero. If that happens then blk-mq destroy queue operation (blk_mq_destroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash. This fix helps avoid the observed crash by implementing keep-alive as a synchronous operation so that we decrement admin->q_usage_counter only after keep-alive command finished its execution and returns the command status back up to its caller (blk_execute_rq()). This would ensure that fabric shutdown code path doesn't destroy the fabric admin queue until keep-alive request finished execution and also keep-alive thread is not running hw/hctx queue dispatch operation. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-09nvme: disable CC.CRIME (NVME_CC_CRIME)Greg Joyce1-6/+8
Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media is ready and able to handle commands without returning NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY. Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro1-1/+1
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-24nvme: remove CC register read-back during enablingKeith Busch1-5/+0
Any non-posted read should flush the previous write, so we don't necessarily need to read back the value we just wrote. I've found at least some controllers that respond with 0 for short moments after writing the CC register with EN (enable) cleared, so the read-back is overwriting our valid ctrl_config value and ends up breaking on the subsequent enabling. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-09-17Merge tag 'v6.11' into for-6.12/blockJens Axboe1-11/+12
Merge in 6.11 final to get the fix for preventing deadlocks on an elevator switch, as there's a fixup for that patch. * tag 'v6.11': (1788 commits) Linux 6.11 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" pinctrl: pinctrl-cy8c95x0: Fix regcache cifs: Fix signature miscalculation mm: avoid leaving partial pfn mappings around in error case drm/xe/client: add missing bo locking in show_meminfo() drm/xe/client: fix deadlock in show_meminfo() drm/xe/oa: Enable Xe2+ PES disaggregation drm/xe/display: fix compat IS_DISPLAY_STEP() range end drm/xe: Fix access_ok check in user_fence_create drm/xe: Fix possible UAF in guc_exec_queue_process_msg drm/xe: Remove fence check from send_tlb_invalidation drm/xe/gt: Remove double include net: netfilter: move nf flowtable bpf initialization in nf_flow_table_module_init() PCI: Fix potential deadlock in pcim_intx() workqueue: Clear worker->pool in the worker thread context net: tighten bad gso csum offset check in virtio_net_hdr netlink: specs: mptcp: fix port endianness net: dpaa: Pad packets to ETH_ZLEN mptcp: pm: Fix uaf in __timer_delete_sync ...
2024-09-06nvme: Convert comma to semicolonShen Lichuan1-1/+1
To ensure code clarity and prevent potential errors, it's advisable to employ the ';' as a statement separator, except when ',' are intentionally used for specific purposes. Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-26nvme: rename apptag and appmask to lbat and lbatmAnuj Gupta1-2/+2
Rename apptag and appmask to lbat and lbatm so that it matches the field names used in NVMe spec. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-23nvme: use better description for async reset reasonKeith Busch1-1/+2
The NVMe AER notification of a persistent internal error triggers a reset. The existing warning message just says "due to AER", which can be confused with the unrelated PCIe AER condition. Just say what the event was instead of the generic overloaded acronym. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22nvme: move stopping keep-alive into nvme_uninit_ctrl()Ming Lei1-1/+1
Commit 4733b65d82bd ("nvme: start keep-alive after admin queue setup") moves starting keep-alive from nvme_start_ctrl() into nvme_init_ctrl_finish(), but don't move stopping keep-alive into nvme_uninit_ctrl(), so keep-alive work can be started and keep pending after failing to start controller, finally use-after-free is triggered if nvme host driver is unloaded. This patch fixes kernel panic when running nvme/004 in case that connection failure is triggered, by moving stopping keep-alive into nvme_uninit_ctrl(). This way is reasonable because keep-alive is now started in nvme_init_ctrl_finish(). Fixes: 3af755a46881 ("nvme: move nvme_stop_keep_alive() back to original position") Cc: Hannes Reinecke <hare@suse.de> Cc: Mark O'Donovan <shiftee@posteo.net> Reported-by: Changhui Zhong <czhong@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22nvme-tcp: sanitize TLS key handlingHannes Reinecke1-1/+0
There is a difference between TLS configured (ie the user has provisioned/requested a key) and TLS enabled (ie the connection is encrypted with TLS). This becomes important for secure concatenation, where the initial authentication is run on an unencrypted connection (ie with TLS configured, but not enabled), and then the queue is reset to run over TLS (ie TLS configured _and_ enabled). So to differentiate between those two states store the generated key in opts->tls_key (as we're using the same TLS key for all queues), the key serial of the resulting TLS handshake in ctrl->tls_pskid (to signal that TLS on the admin queue is enabled), and a simple flag for the queues to indicated that TLS has been enabled. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22nvme_core: scan namespaces asynchronouslyStuart Hayes1-1/+39
Use async function calls to make namespace scanning happen in parallel. Without the patch, NVME namespaces are scanned serially, so it can take a long time for all of a controller's namespaces to become available, especially with a slower (TCP) interface with large number of namespaces. It is not uncommon to have large numbers (hundreds or thousands) of namespaces on nvme-of with storage servers. The time it took for all namespaces to show up after connecting (via TCP) to a controller with 1002 namespaces was measured on one system: network latency without patch with patch 0 6s 1s 50ms 210s 10s 100ms 417s 18s Measurements taken on another system show the effect of the patch on the time nvme_scan_work() took to complete, when connecting to a linux nvme-of target with varying numbers of namespaces, on a network of 400us. namespaces without patch with patch 1 16ms 14ms 2 24ms 16ms 4 49ms 22ms 8 101ms 33ms 16 207ms 56ms 100 1.4s 0.6s 1000 12.9s 2.0s On the same system, connecting to a local PCIe NVMe drive (a Samsung PM1733) instead of a network target: namespaces without patch with patch 1 13ms 12ms 2 41ms 13ms Signed-off-by: Stuart Hayes <stuart.w.hayes@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2024-07-31nvme: remove a field from nvme_ns_headKanchan Joshi1-8/+8
pi_offset field is not required to be present in nvme_ns_head. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-29nvme: remove unused parameterKanchan Joshi1-3/+3
First parameter of nvme_init_integrity() is unused. Remove it, and modify the callers. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-26Merge tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme into block-6.11Jens Axboe1-1/+7
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.11 - Fix request without payloads cleanup (Leon) - Use new protection information format (Francis) - Improved debug message for lost pci link (Bart) - Another apst quirk (Wang) - Use appropriate sysfs api for printing chars (Markus)" * tag 'nvme-6.11-2024-07-26' of git://git.infradead.org/nvme: nvme-pci: add missing condition check for existence of mapped data nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPE nvme-pci: Fix the instructions for disabling power management nvme: remove redundant bdev local variable nvme-fabrics: Use seq_putc() in __nvmf_concat_opt_tokens() nvme/pci: Add APST quirk for Lenovo N60z laptop
2024-07-16nvme-core: choose PIF from QPIF if QPIFS supports and PIF is QTYPEFrancis Pravin1-1/+7
As per TP4141a: "If the Qualified Protection Information Format Support(QPIFS) bit is set to 1 and the Protection Information Format(PIF) field is set to 11b (i.e., Qualified Type), then the pif is as defined in the Qualified Protection Information Format (QPIF) field." So, choose PIF from QPIF if QPIFS supports and PIF is QTYPE. Signed-off-by: Francis Pravin <francis.p@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-15Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linuxLinus Torvalds1-95/+191
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...
2024-07-08Merge tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme into for-6.11/blockJens Axboe1-44/+83
Pull NVMe updates from Keith: "nvme updates for Linux 6.11 - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng)" * tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits) nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id nvme-multipath: implement "queue-depth" iopolicy nvme-multipath: prepare for "queue-depth" iopolicy nvme-pci: do not directly handle subsys reset fallout lpfc_nvmet: implement 'host_traddr' nvme-fcloop: implement 'host_traddr' nvmet-fc: implement host_traddr() nvmet-rdma: implement host_traddr() nvmet-tcp: implement host_traddr() nvmet: add 'host_traddr' callback for debugfs nvmet: add debugfs support mailmap: add entry for Weiwen Hu nvme: rename CDR/MORE/DNR to NVME_STATUS_* nvme: fix status magic numbers nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err nvme: split device add from initialization nvme: fc: split controller bringup handling nvme: rdma: split controller bringup handling nvme: tcp: split controller bringup handling ...
2024-07-08nvme: implement ->get_unique_idChristoph Hellwig1-0/+27
Implement the get_unique_id method to allow pNFS SCSI layout access to NVMe namespaces. This is the server side implementation of RFC 9561 "Using the Parallel NFS (pNFS) SCSI Layout to Access Non-Volatile Memory Express (NVMe) Storage Devices". Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-02nvme-multipath: implement "queue-depth" iopolicyThomas Song1-1/+1
The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers. The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance. Signed-off-by: Thomas Song <tsong@purestorage.com> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/ Tested-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Jyoti Rani <jrani@purestorage.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-01nvme: don't set io_opt if NOWS is zeroChristoph Hellwig1-1/+2
NOWS is one of the annoying "0's based values" in NVMe, where 0 means one and we thus can't detect if it isn't set. Thus a NOWS value of 0 means that the Namespace Optimal Write Size is a single LBA, which is clearly bogus. Ignore the value in that case and don't propagate an io_opt value to the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240701051800.1245240-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-24nvme: rename CDR/MORE/DNR to NVME_STATUS_*Weiwen Hu1-11/+11
CDR/MORE/DNR fields are not belonging to SC in the NVMe spec, rename them to NVME_STATUS_* to avoid confusion. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme: fix status magic numbersWeiwen Hu1-9/+9
Replaced some magic numbers about SC and SCT with enum and macro. Signed-off-by: Weiwen Hu <huweiwen@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme: split device add from initializationKeith Busch1-23/+35
Combining both creates an ambiguous cleanup scenario for the caller if an error is returned: does the device reference need to be dropped or did the error occur before the device was initialized? If an error occurs after the device is added, then the existing cleanup routines will leak memory. Furthermore, the nvme core is taking it upon itself to free the device's kobj name under certain conditions rather than go through the core device API. We shouldn't be peaking into these implementation details. Split the device initialization from the addition to make it easier to know the error handling actions, fix the existing memory leaks, and stop the device layering violations. Link: https://lore.kernel.org/linux-nvme/c4050a37-ecc9-462c-9772-65e25166f439@grimberg.me/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-20nvme: Atomic write supportAlan Adamson1-0/+52
Add support to set block layer request_queue atomic write limits. The limits will be derived from either the namespace or controller atomic parameters. NVMe atomic-related parameters are grouped into "normal" and "power-fail" (or PF) class of parameter. For atomic write support, only PF parameters are of interest. The "normal" parameters are concerned with racing reads and writes (which also applies to PF). See NVM Command Set Specification Revision 1.0d section 2.1.4 for reference. Whether to use per namespace or controller atomic parameters is decided by NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data Structure, NVM Command Set. NVMe namespaces may define an atomic boundary, whereby no atomic guarantees are provided for a write which straddles this per-lba space boundary. The block layer merging policy is such that no merges may occur in which the resultant request would straddle such a boundary. Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from atomic boundary rule. In addition, again unlike SCSI, there is no dedicated atomic write command - a write which adheres to the atomic size limit and boundary is implicitly atomic. If NSFEAT bit 1 is set, the following parameters are of interest: - NAWUPF (Namespace Atomic Write Unit Power Fail) - NABSPF (Namespace Atomic Boundary Size Power Fail) - NABO (Namespace Atomic Boundary Offset) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(NAWUPF) - atomic_write_max_bytes = NAWUPF - atomic_write_boundary = NABSPF If in the unlikely scenario that NABO is non-zero, then atomic writes will not be supported at all as dealing with this adds extra complexity. This policy may change in future. In all cases, atomic_write_unit_min is set to the logical block size. If NSFEAT bit 1 is unset, the following parameter is of interest: - AWUPF (Atomic Write Unit Power Fail) and we set request_queue limits as follows: - atomic_write_unit_max = rounddown_pow_of_two(AWUPF) - atomic_write_max_bytes = AWUPF - atomic_write_boundary = 0 A new function, nvme_valid_atomic_write(), is also called from submission path to verify that a request has been submitted to the driver will actually be executed atomically. As mentioned, there is no dedicated NVMe atomic write command (which may error for a command which exceeds the controller atomic write limits). Note on NABSPF: There seems to be some vagueness in the spec as to whether NABSPF applies for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF and how it is affected by bit 1. However Figure 4 does tell to check Figure 97 for info about per-namespace parameters, which NABSPF is, so it is implied. However currently nvme_update_disk_info() does check namespace parameter NABO regardless of this bit. Signed-off-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> jpg: total rewrite Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240620125359.2684798-11-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the skip_tagset_quiesce flag to queue_limitsChristoph Hellwig1-3/+5
Move the skip_tagset_quiesce flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-26-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the pci_p2pdma flag to queue_limitsChristoph Hellwig1-5/+3
Move the pci_p2pdma flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-25-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the stable_writes flag to queue_limitsChristoph Hellwig1-4/+5
Move the stable_writes flag into the queue_limits feature field so that it can be set atomically with the queue frozen. The flag is now inherited by blk_stack_limits, which greatly simplifies the code in dm, and fixed md which previously did not pass on the flag set on lower devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move the nonrot flag to queue_limitsChristoph Hellwig1-1/+0
Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move cache control settings out of queue->flagsChristoph Hellwig1-2/+5
Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: move integrity information into queue_limitsChristoph Hellwig1-34/+36
Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: remove the blk_integrity_profile structureChristoph Hellwig1-9/+8
Block layer integrity configuration is a bit complex right now, as it indirects through operation vectors for a simple two-dimensional configuration: a) the checksum type of none, ip checksum, crc, crc64 b) the presence or absence of a reference tag Remove the integrity profile, and instead add a separate csum_type flag which replaces the existing ip-checksum field and a new flag that indicates the presence of the reference tag. This removes up to two layers of indirect calls, remove the need to offload the no-op verification of non-PI metadata to a workqueue and generally simplifies the code. The downside is that block/t10-pi.c now has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is supported. Given that both nvme and SCSI require t10-pi.ko, it is loaded for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY already, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-13nvme: fix namespace removal listKeith Busch1-4/+5
This function wants to move a subset of a list from one element to the tail into another list. It also needs to use the srcu synchronize instead of the regular rcu version. Do this one element at a time because that's the only to do it. Fixes: be647e2c76b27f4 ("nvme: use srcu for iterating namespace list") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-12nvme: avoid double free special payloadChunguang Xu1-0/+1
If a discard request needs to be retried, and that retry may fail before a new special payload is added, a double free will result. Clear the RQF_SPECIAL_LOAD when the request is cleaned. Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-28nvme: use srcu for iterating namespace listKeith Busch1-40/+59
The nvme pci driver synchronizes with all the namespace queues during a reset to ensure that there's no pending timeout work. Meanwhile the timeout work potentially iterates those same namespaces to freeze their queues. Each of those namespace iterations use the same read lock. If a write lock should somehow get between the synchronize and freeze steps, then forward progress is deadlocked. We had been relying on the nvme controller state machine to ensure the reset work wouldn't conflict with timeout work. That guarantee may be a bit fragile to rely on, so iterate the namespace lists without taking potentially circular locks, as reported by lockdep. Link: https://lore.kernel.org/all/20220930001943.zdbvolc3gkekfmcv@shindev/ Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-23nvme-multipath: fix io accounting on failoverKeith Busch1-1/+1
There are io stats accounting that needs to be handled, so don't call blk_mq_end_request() directly. Use the existing nvme_end_req() helper that already handles everything. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-23nvme: fix multipath batched completion accountingKeith Busch1-5/+10
Batched completions were missing the io stats accounting and bio trace events. Move the common code to a helper and call it from the batched and non-batched functions. Fixes: d4d957b53d91ee ("nvme-multipath: support io stats on the mpath device") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-14Merge tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme into block-6.10Jens Axboe1-3/+3
Pull NVMe updates and fixes from Keith: "nvme updates for Linux 6.10 - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi)" * tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme: nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures nvme-fabrics: short-circuit reconnect retries nvme: return kernel error codes for admin queue connect nvmet: return DHCHAP status codes from nvmet_setup_auth() nvmet: lock config semaphore when accessing DH-HMAC-CHAP key