aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/drivers/nvme/host/core.c (follow)
AgeCommit message (Collapse)AuthorFilesLines
2023-12-19nvme-pci: fix sleeping function called from interrupt contextMaurizio Lombardi1-1/+2
the nvme_handle_cqe() interrupt handler calls nvme_complete_async_event() but the latter may call nvme_auth_stop() which is a blocking function. Sleeping functions can't be called in interrupt context BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/15 Call Trace: <IRQ> __cancel_work_timer+0x31e/0x460 ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] nvme_complete_async_event+0x365/0x480 [nvme_core] nvme_poll_cq+0x262/0xe50 [nvme] Fix the bug by moving nvme_auth_stop() to fw_act_work (executed by the nvme_wq workqueue) Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: add csi, ms and nuse to sysfsDaniel Wagner1-1/+5
libnvme is using the sysfs for enumarating the nvme resources. Though there are few missing attritbutes in the sysfs. For these libnvme issues commands during discovering. As the kernel already knows all these attributes and we would like to avoid libnvme to issue commands all the time, expose these missing attributes. The nuse value is updated on request because the nuse is a volatile value. Since any user can read the sysfs attribute, a very simple rate limit is added (update once every 5 seconds). A more sophisticated update strategy can be added later if there is actually a need for it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: rename ns attribute groupDaniel Wagner1-1/+1
Drop the 'id' part of the attribute group name because we want to expose non 'id' related attributes via the ns attribute group. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info setup functionDaniel Wagner1-53/+53
Use nvme_ns_head instead of nvme_ns where possible. This reduces the coupling between the different data structures. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info helpersDaniel Wagner1-15/+21
Pass in the nvme_ns_head pointer directly. This reduces the necessity on the caller side have the nvme_ns data structure present. Thus we can refactor the caller side in the next step as well. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: move ns id info to struct nvme_ns_headDaniel Wagner1-39/+41
Move the namesapce info to struct nvme_ns_head, because it's the same for all associated namespaces. Note: with multipathing enabled the PI information is shared between all paths. If a path is using a different PI configuration it will overwrite the previous settings. This is obviously not correct and such configuration will be rejected in future. For the time being we expect a correctly configured storage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme-fabrics: check ioccsz and iorcszGuixin Liu1-0/+14
Make sure that ioccsz and iorcsz returned by target are correct before use it. Per 2.0a base NVMe spec: I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines the maximum I/O command capsule size in 16 byte units. The minimum value that shall be indicated is 4 corresponding to 64 bytes. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme: introduce nvme_check_ctrl_fabric_info helperGuixin Liu1-18/+24
Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info returned by target. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: fix deadlock between reset and scanBitao Hu1-0/+10
If controller reset occurs when allocating namespace, both nvme_reset_work and nvme_scan_work will hang, as shown below. Test Scripts: for ((t=1;t<=128;t++)) do nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \ -d 0 | awk -F: '{print($NF);}'` nvme attach-ns /dev/nvme1 -n $nsid -c 0 done nvme reset /dev/nvme1 We will find that both nvme_reset_work and nvme_scan_work hung: INFO: task kworker/u249:4:17848 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:4 state:D stack: 0 pid:17848 ppid: 2 flags:0x00000028 Workqueue: nvme-reset-wq nvme_reset_work [nvme] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 blk_mq_freeze_queue_wait+0x84/0xc0 nvme_wait_freeze+0x40/0x64 [nvme_core] nvme_reset_work+0x1c0/0x5cc [nvme] process_one_work+0x1d8/0x4b0 worker_thread+0x230/0x440 kthread+0x114/0x120 INFO: task kworker/u249:3:22404 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:3 state:D stack: 0 pid:22404 ppid: 2 flags:0x00000028 Workqueue: nvme-wq nvme_scan_work [nvme_core] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 rwsem_down_write_slowpath+0x32c/0x98c down_write+0x70/0x80 nvme_alloc_ns+0x1ac/0x38c [nvme_core] nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core] nvme_scan_ns_list+0xe8/0x2e4 [nvme_core] nvme_scan_work+0x60/0x500 [nvme_core] process_one_work+0x1d8/0x4b0 worker_thread+0x260/0x440 kthread+0x114/0x120 INFO: task nvme:28428 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:nvme state:D stack: 0 pid:28428 ppid: 27119 flags:0x00000000 Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 schedule_timeout+0x160/0x194 do_wait_for_common+0xac/0x1d0 __wait_for_common+0x78/0x100 wait_for_completion+0x24/0x30 __flush_work.isra.0+0x74/0x90 flush_work+0x14/0x20 nvme_reset_ctrl_sync+0x50/0x74 [nvme_core] nvme_dev_ioctl+0x1b0/0x250 [nvme_core] __arm64_sys_ioctl+0xa8/0xf0 el0_svc_common+0x88/0x234 do_el0_svc+0x7c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 The reason for the hang is that nvme_reset_work occurs while nvme_scan_work is still running. nvme_scan_work may add new ns into ctrl->namespaces list after nvme_reset_work frozen all ns->q in ctrl->namespaces list. The newly added ns is not frozen, so nvme_wait_freeze will wait forever. Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so nvme_scan_work will also wait forever. Now we are deadlocked! PROCESS1 PROCESS2 ============== ============== nvme_scan_work ... nvme_reset_work nvme_validate_or_alloc_ns nvme_dev_disable nvme_alloc_ns nvme_start_freeze down_write ... nvme_ns_add_to_ctrl_list ... up_write nvme_wait_freeze ... down_read nvme_alloc_ns blk_mq_freeze_queue_wait down_write Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check it before adding the new namespace (under the namespaces_rwsem). Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: ensure reset state check orderingKeith Busch1-20/+22
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optimizing to a stale state. Normally it isn't a problem to observe the wrong state as it is merely advisory to take a quicker path during initialization and error recovery, but seeing an old state can report unexpected ENETRESET errors when a reset request was in fact successful. Reported-by: Minh Hoang <mh2022@meta.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de>
2023-12-01nvme-core: check for too small lba shiftKeith Busch1-2/+3
The block layer doesn't support logical block sizes smaller than 512 bytes. The nvme spec doesn't support that small either, but the driver isn't checking to make sure the device responded with usable data. Failing to catch this will result in a kernel bug, either from a division by zero when stacking, or a zero length bio. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: check for valid nvme_identify_ns() before using itEwan D. Milne1-0/+9
When scanning namespaces, it is possible to get valid data from the first call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second call in nvme_update_ns_info_block(). In particular, if the NSID becomes inactive between the two commands, a storage device may return a buffer filled with zero as per 4.1.5.1. In this case, we can get a kernel crash due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will be set to zero. PID: 326 TASK: ffff95fec3cd8000 CPU: 29 COMMAND: "kworker/u98:10" #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7 #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788 #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595 #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6 #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926 [exception RIP: blk_stack_limits+434] RIP: ffffffff92191872 RSP: ffffad8f8702fc80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff95efa0c91800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001 RBP: 00000000ffffffff R8: ffff95fec7df35a8 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff95fed33c09a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core] #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core] This happened when the check for valid data was moved out of nvme_identify_ns() into one of the callers. Fix this by checking in both callers. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186 Fixes: 0dd6fff2aad4 ("nvme: bring back auto-removal of deleted namespaces during sequential scan") Cc: stable@vger.kernel.org Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme-core: fix a memory leak in nvme_ns_info_from_identify()Maurizio Lombardi1-2/+5
In case of error, free the nvme_id_ns structure that was allocated by nvme_identify_ns(). Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: fine-tune sending of first keep-aliveMark O'Donovan1-2/+11
Keep-alive commands are sent half-way through the kato period. This normally works well but fails when the keep-alive system is started when we are more than half way through the kato. This can happen on larger setups or due to host delays. With this change we now time the initial keep-alive command from the controller initialisation time, rather than the keep-alive mechanism activation time. Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-22nvme: move nvme_stop_keep_alive() back to original positionHannes Reinecke1-1/+1
Stopping keep-alive not only stops the keep-alive workqueue, but also needs to be synchronized with I/O termination as we must not send a keep-alive command after all I/O had been terminated. So to avoid any regressions move the call to stop_keep_alive() back to its original position and ensure that keep-alive is correctly stopped failing to setup the admin queue. Fixes: 4733b65d82bd ("nvme: start keep-alive after admin queue setup") Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: catch errors from nvme_configure_metadata()Hannes Reinecke1-6/+13
nvme_configure_metadata() is issuing I/O, so we might incur an I/O error which will cause the connection to be reset. But in that case any further probing will race with reset and cause UAF errors. So return a status from nvme_configure_metadata() and abort probing if there was an I/O error. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-08nvme: keyring: fix conditional compilationHannes Reinecke1-8/+1
The keyring and auth functions can be called from both the host and the target side and are controlled by Kconfig options for each of the combinations, but the declarations are controlled by #ifdef checks on the shared Kconfig symbols. This leads to link failures in combinations where one of the frontends is built-in and the other one is a module, and the keyring code ends up in a module that is not reachable from the builtin code: ld: drivers/nvme/host/core.o: in function `nvme_core_exit': core.c:(.exit.text+0x4): undefined reference to `nvme_keyring_exit' ld: drivers/nvme/host/core.o: in function `nvme_core_init': core.c:(.init.text+0x94): undefined reference to `nvme_keyring_init ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl': tcp.c:(.text+0x4c18): undefined reference to `nvme_tls_psk_default' Address this by moving nvme_keyring_init()/nvme_keyring_exit() into module init/exit functions for the keyring module. Fixes: be8e82caa6859 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: start keep-alive after admin queue setupHannes Reinecke1-3/+3
Setting up I/O queues might take quite some time on larger and/or busy setups, so KATO might expire before all I/O queues could be set up. Fix this by start keep alive from the ->init_ctrl_finish() callback, and stopping it when calling nvme_cancel_admin_tagset(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> [fixed nvme-fc compile error] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: update firmware version after commitDaniel Wagner1-1/+14
The firmware version sysfs entry needs to be updated after a successfully firmware activation. nvme-cli stopped issuing an Identify Controller command to list the current firmware information and relies on sysfs showing the current firmware version. Reported-by: Kenji Tomonaga <tkenbo@gmail.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Tested-by: Kenji Tomonaga <tkenbo@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> [fixed off-by one afi index] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-12nvme: rework NVME_AUTH Kconfig selectionHannes Reinecke1-1/+1
Having a single Kconfig symbol NVME_AUTH conflates the selection of the authentication functions from nvme/common and nvme/host, causing kbuild robot to complain when building the nvme target only. So introduce a Kconfig symbol NVME_HOST_AUTH for the nvme host bits and use NVME_AUTH for the common functions only. And move the CRYPTO selection into nvme/common to make it easier to read. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202310120733.TlPOVeJm-lkp@intel.com/ Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: enable TLS handshake upcallHannes Reinecke1-1/+1
Add a fabrics option 'tls' and start the TLS handshake upcall with the default PSK. When TLS is started the PSK key serial number is displayed in the sysfs attribute 'tls_key' Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-keyring: register '.nvme' keyringHannes Reinecke1-2/+8
Register a '.nvme' keyring to hold keys for TLS and DH-HMAC-CHAP and add a new config option NVME_KEYRING. We need a separate keyring for NVMe as the configuration is done via individual commands (eg for configfs), and the usual per-session or per-process keyrings can't be used. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-09-14Merge tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme into block-6.6Jens Axboe1-19/+35
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.6 - nvme-tcp iov len fix (Varun) - nvme-hwmon const qualifier for safety (Krzysztof) - nvme-fc null pointer checks (Nigel) - nvme-pci no numa node fix (Pratyush) - nvme timeout fix for non-compliant controllers (Keith)" * tag 'nvme-6.6-2023-09-14' of git://git.infradead.org/nvme: nvme: avoid bogus CRTO values nvme-pci: do not set the NUMA node of device if it has none nvme-fc: Prevent null pointer dereference in nvme_fc_io_getuuid() nvme: host: hwmon: constify pointers to hwmon_channel_info nvmet-tcp: pass iov_len instead of sg->length to bvec_set_page()
2023-09-14nvme: avoid bogus CRTO valuesKeith Busch1-19/+35
Some devices are reporting controller ready mode support, but return 0 for CRTO. These devices require a much higher time to ready than that, so they are failing to initialize after the driver starter preferring that value over CAP.TO. The spec requires that CAP.TO match the appropritate CRTO value, or be set to 0xff if CRTO is larger than that. This means that CAP.TO can be used to validate if CRTO is reliable, and provides an appropriate fallback for setting the timeout value if not. Use whichever is larger. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217863 Reported-by: Cláudio Sampaio <patola@gmail.com> Reported-by: Felix Yan <felixonmars@archlinux.org> Tested-by: Felix Yan <felixonmars@archlinux.org> Based-on-a-patch-by: Felix Yan <felixonmars@archlinux.org> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-21nvme: fix possible hang when removing a controller during error recoveryMing Lei1-3/+7
Error recovery can be interrupted by controller removal, then the controller is left as quiesced, and IO hang can be caused. Fix the issue by unquiescing controller unconditionally when removing namespaces. This way is reasonable and safe given forward progress can be made when removing namespaces. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reported-by: Chunguang Xu <brookxu.cn@gmail.com> Closes: https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/ Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-13nvme: don't reject probe due to duplicate IDs for single-ported PCIe devicesChristoph Hellwig1-3/+33
While duplicate IDs are still very harmful, including the potential to easily see changing devices in /dev/disk/by-id, it turn out they are extremely common for cheap end user NVMe devices. Relax our check for them for so that it doesn't reject the probe on single-ported PCIe devices, but prints a big warning instead. In doubt we'd still like to see quirk entries to disable the potential for changing supposed stable device identifier links, but this will at least allow users how have two (or more) of these devices to use them without having to manually add a new PCI ID entry with the quirk through sysfs or by patching the kernel. Fixes: 2079f41ec6ff ("nvme: check that EUI/GUID/UUID are globally unique") Cc: stable@vger.kernel.org # 6.0+ Co-developed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-03Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linuxLinus Torvalds1-1/+5
Pull more block updates from Jens Axboe: "Mostly items that came in a bit late for the initial pull request, wanted to make sure they had the appropriate amount of linux-next soak before going upstream. Outside of stragglers, just generic fixes for either merge window items, or longer standing bugs" * tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits) md/raid0: add discard support for the 'original' layout nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown cdrom/gdrom: Fix build error nvme: improved uring polling block: add request polling helper nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling blk-sysfs: add a new attr_group for blk_mq blk-iocost: move wbt_enable/disable_default() out of spinlock blk-wbt: cleanup rwb_enabled() and wbt_disabled() blk-wbt: remove dead code to handle wbt enable/disable with io inflight blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled blk-mq: fix two misuses on RQF_USE_SCHED blk-throttle: Fix io statistics for cgroup v1 bcache: Fix bcache device claiming bcache: Alloc holder object before async registration raid10: avoid spin_lock from fastpath from raid10_unplug() md: fix 'delete_mutex' deadlock ...
2023-06-30Merge tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme into block-6.5Jens Axboe1-1/+5
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.5 - Reduce spamming kernel logs on repeated controller updates (Breno) - Improved struct packing (Christophe JAILLET) - Misspelled command name in error logging (Damien) - Failover fix for temporary frozen queue (Sagi) - Reset error handling fixes (Keith)" * tag 'nvme-6.5-2023-06-30' of git://git.infradead.org/nvme: nvme: disable controller on reset state failure nvme: sync timeout work on failed reset nvme: ensure unquiesce on teardown nvme-mpath: fix I/O failure with EAGAIN when failing over I/O nvme: host: fix command name spelling nvmet: Reorder fields in 'struct nvmet_ns' nvme: Print capabilities changes just once
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-148/+1
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-658/+14
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-21nvme: Print capabilities changes just onceBreno Leitao1-1/+5
This current dev_info() could be very verbose and being printed very frequently depending on some userspace application sending some specific commands. Just print this message once and skip it until the controller resets. Use a controller flag (NVME_CTRL_DIRTY_CAPABILITY) to track if the capability needs a reset. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme: skip optional id ctrl csi if it failedKeith Busch1-1/+4
A frequently recieved report is the driver requests the optional Command Set Specific Identify Controller structure. Some controllers report this in their error log, which tiggers other warnings to user space monitoring the devices. These error entries are harmless and of questionable value to save in the log, but let's reduce their occurance by not resending the command if it previously failed. This will not prevent the errors on the initial module load, but will greatly reduce their occurance on any rescans and resumes from suspend. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217445 Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: use nvme_ns_head_multipath instead of ns->head->diskIrvin Cote1-1/+1
Change the way we check for a multipath nshead so as to consistently use the same check to assert the same condition. Signed-off-by: Irvin Cote <irvincoteg@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme: Increase block size variable size to 32-bitDaniel Gomez1-1/+1
Increase block size variable size to 32-bit unsigned to be able to support block devices larger than 32k (starting from 64 KiB). Physical and logical block size already support unsigned 32-bit. Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: remove redundant check from nvme_init_ns_headIrvin Cote1-1/+1
nvme_find_ns_head already checks that the list of namescpaces in an already existing namespace head is not empty Signed-off-by: Irvin Cote <irvincoteg@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme: move sysfs code to a dedicated sysfs.c fileMax Gurtovoy1-654/+2
The core.c file became long and hard to maintain. Create a dedicated file to centralize the sysfs functionality. This is a common practice to separate sysfs/configfs related logic from the main driver logic .c file. For example, in the nvmet module the configfs interface has its own dedicated file. This patch does not include any functional changes. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> [merged dhchap memleak fixes, include nvme-auth.h] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: fix dev_pm_qos memleakChaitanya Kulkarni1-0/+1
Call dev_pm_qos_hide_latency_tolerance() in the error unwind patch to avoid following kmemleak:- blktests (master) # kmemleak-clear; ./check nvme/044; blktests (master) # kmemleak-scan ; kmemleak-show nvme/044 (Test bi-directional authentication) [passed] runtime 2.111s ... 2.124s unreferenced object 0xffff888110c46240 (size 96): comm "nvme", pid 33461, jiffies 4345365353 (age 75.586s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<0000000069ac2cec>] kmalloc_trace+0x25/0x90 [<000000006acc66d5>] dev_pm_qos_update_user_latency_tolerance+0x6f/0x100 [<00000000cc376ea7>] nvme_init_ctrl+0x38e/0x410 [nvme_core] [<000000007df61b4b>] 0xffffffffc05e88b3 [<00000000d152b985>] 0xffffffffc05744cb [<00000000f04a4041>] vfs_write+0xc5/0x3c0 [<00000000f9491baf>] ksys_write+0x5f/0xe0 [<000000001c46513d>] do_syscall_64+0x3b/0x90 [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc Link: https://lore.kernel.org/linux-nvme/CAHj4cs-nDaKzMx2txO4dbE+Mz9ePwLtU0e3egz+StmzOUgWUrA@mail.gmail.com/ Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: add missing fault-injection cleanupChaitanya Kulkarni1-0/+1
Add missing fault-injection cleanup in nvme_init_ctrl() in the error unwind path that also fixes following message for blktests:- linux-block (for-next) # grep debugfs debugfs-err.log [ 147.853464] debugfs: Directory 'nvme1' with parent '/' already present! [ 147.853973] nvme1: failed to create debugfs attr [ 148.802490] debugfs: Directory 'nvme1' with parent '/' already present! [ 148.803244] nvme1: failed to create debugfs attr [ 148.877304] debugfs: Directory 'nvme1' with parent '/' already present! [ 148.877775] nvme1: failed to create debugfs attr [ 149.816652] debugfs: Directory 'nvme1' with parent '/' already present! [ 149.818011] nvme1: failed to create debugfs attr Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: fix memory leak in dhchap_ctrl_secretChaitanya Kulkarni1-2/+5
Free dhchap_secret in nvme_ctrl_dhchap_ctrl_secret_store() before we return when nvme_auth_generate_key() returns error. Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12nvme-core: fix memory leak in dhchap_secret_storeChaitanya Kulkarni1-2/+5
Free dhchap_secret in nvme_ctrl_dhchap_secret_store() before we return fix following kmemleack:- unreferenced object 0xffff8886376ea800 (size 64): comm "check", pid 22048, jiffies 4344316705 (age 92.199s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67 DHHC-1:00:nxr5Kg 75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c uX4uoAxsJa4c/huL backtrace: [<0000000030ce5d4b>] __kmalloc+0x4b/0x130 [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core] [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0 [<00000000437e7ced>] vfs_write+0x2ba/0x3c0 [<00000000f9491baf>] ksys_write+0x5f/0xe0 [<000000001c46513d>] do_syscall_64+0x3b/0x90 [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc unreferenced object 0xffff8886376eaf00 (size 64): comm "check", pid 22048, jiffies 4344316736 (age 92.168s) hex dump (first 32 bytes): 44 48 48 43 2d 31 3a 30 30 3a 6e 78 72 35 4b 67 DHHC-1:00:nxr5Kg 75 58 34 75 6f 41 78 73 4a 61 34 63 2f 68 75 4c uX4uoAxsJa4c/huL backtrace: [<0000000030ce5d4b>] __kmalloc+0x4b/0x130 [<000000009be1cdc1>] nvme_ctrl_dhchap_secret_store+0x8f/0x160 [nvme_core] [<00000000ac06c96a>] kernfs_fop_write_iter+0x12b/0x1c0 [<00000000437e7ced>] vfs_write+0x2ba/0x3c0 [<00000000f9491baf>] ksys_write+0x5f/0xe0 [<000000001c46513d>] do_syscall_64+0x3b/0x90 [<00000000ecf348fe>] entry_SYSCALL_64_after_hwframe+0x72/0xdc Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-12block: replace fmode_t with a block-specific type for block open flagsChristoph Hellwig1-1/+1
The only overlap between the block open flags mapped into the fmode_t and other uses of fmode_t are FMODE_READ and FMODE_WRITE. Define a new blk_mode_t instead for use in blkdev_get_by_{dev,path}, ->open and ->ioctl and stop abusing fmode_t. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-28-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: remove the unused mode argument to ->releaseChristoph Hellwig1-1/+1
The mode argument to the ->release block_device_operation is never used, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: pass a gendisk to ->openChristoph Hellwig1-2/+2
->open is only called on the whole device. Make that explicit by passing a gendisk instead of the block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30nvme: improve handling of long keep alivesUday Shankar1-1/+15
Upon keep alive completion, nvme_keep_alive_work is scheduled with the same delay every time. If keep alive commands are completing slowly, this may cause a keep alive timeout. The following trace illustrates the issue, taking KATO = 8 and TBKAS off for simplicity: 1. t = 0: run nvme_keep_alive_work, send keep alive 2. t = ε: keep alive reaches controller, controller restarts its keep alive timer 3. t = 4: host receives keep alive completion, schedules nvme_keep_alive_work with delay 4 4. t = 8: run nvme_keep_alive_work, send keep alive Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε between the controller receiving successive keep alives. With ε small, the controller is likely to detect a keep alive timeout. Fix this by calculating the RTT of the keep alive command, and adjusting the scheduling delay of the next keep alive work accordingly. Reported-by: Costa Sapuntzakis <costa@purestorage.com> Reported-by: Randy Jennings <randyj@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30nvme: check IO start time when deciding to defer KAUday Shankar1-1/+13
When a command completes, we set a flag which will skip sending a keep alive at the next run of nvme_keep_alive_work when TBKAS is on. However, if the command was submitted long ago, it's possible that the controller may have also restarted its keep alive timer (as a result of receiving the command) long ago. The following trace demonstrates the issue, assuming TBKAS is on and KATO = 8 for simplicity: 1. t = 0: submit I/O commands A, B, C, D, E 2. t = 0.5: commands A, B, C, D, E reach controller, restart its keep alive timer 3. t = 1: A completes 4. t = 2: run nvme_keep_alive_work, see recent completion, do nothing 5. t = 3: B completes 6. t = 4: run nvme_keep_alive_work, see recent completion, do nothing 7. t = 5: C completes 8. t = 6: run nvme_keep_alive_work, see recent completion, do nothing 9. t = 7: D completes 10. t = 8: run nvme_keep_alive_work, see recent completion, do nothing 11. t = 9: E completes At this point, 8.5 seconds have passed without restarting the controller's keep alive timer, so the controller will detect a keep alive timeout. Fix this by checking the IO start time when deciding to defer sending a keep alive command. Only set comp_seen if the command started after the most recent run of nvme_keep_alive_work. With this change, the completions of B, C, and D will not set comp_seen and the run of nvme_keep_alive_work at t = 4 will send a keep alive. Reported-by: Costa Sapuntzakis <costa@purestorage.com> Reported-by: Randy Jennings <randyj@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30nvme: double KA polling frequency to avoid KATO with TBKAS onUday Shankar1-1/+17
With TBKAS on, the completion of one command can defer sending a keep alive for up to twice the delay between successive runs of nvme_keep_alive_work. The current delay of KATO / 2 thus makes it possible for one command to defer sending a keep alive for up to KATO, which can result in the controller detecting a KATO. The following trace demonstrates the issue, taking KATO = 8 for simplicity: 1. t = 0: run nvme_keep_alive_work, no keep-alive sent 2. t = ε: I/O completion seen, set comp_seen = true 3. t = 4: run nvme_keep_alive_work, see comp_seen == true, skip sending keep-alive, set comp_seen = false 4. t = 8: run nvme_keep_alive_work, see comp_seen == false, send a keep-alive command. Here, there is a delay of 8 - ε between receiving a command completion and sending the next command. With ε small, the controller is likely to detect a keep alive timeout. Fix this by running nvme_keep_alive_work with a delay of KATO / 4 whenever TBKAS is on. Going through the above trace now gives us a worst-case delay of 4 - ε, which is in line with the recommendation of sending a command every KATO / 2 in the NVMe specification. Reported-by: Costa Sapuntzakis <costa@purestorage.com> Reported-by: Randy Jennings <randyj@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-30nvme: fix miss command type checkmin15.li1-1/+3
In the function nvme_passthru_end(), only the value of the command opcode is checked, without checking the command type (IO command or Admin command). When we send a Dataset Management command (The opcode of the Dataset Management command is the same as the Set Feature command), kernel thinks it is a set feature command, then sets the controller's keep alive interval, and calls nvme_keep_alive_work(). Signed-off-by: min15.li <min15.li@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-05-22Merge patch series "Use block pr_ops in LIO"Martin K. Petersen1-148/+1
Mike Christie <michael.christie@oracle.com> says: The patches in this thread allow us to use the block pr_ops with LIO's target_core_iblock module to support cluster applications in VMs. They were built over Linus's tree. They also apply over linux-next and Martin's tree and Jens's trees. Currently, to use windows clustering or linux clustering (pacemaker + cluster labs scsi fence agents) in VMs with LIO and vhost-scsi, you have to use tcmu or pscsi or use a cluster aware FS/framework for the LIO pr file. Setting up a cluster FS/framework is pain and waste when your real backend device is already a distributed device, and pscsi and tcmu are nice for specific use cases, but iblock gives you the best performance and allows you to use stacked devices like dm-multipath. So these patches allow iblock to work like pscsi/tcmu where they can pass a PR command to the backend module. And then iblock will use the pr_ops to pass the PR command to the real devices similar to what we do for unmap today. The patches are separated in the following groups: Patch 1 - 2: - Add block layer callouts for reading reservations and rename reservation error code. Patch 3 - 5: - SCSI support for new callouts. Patch 6: - DM support for new callouts. Patch 7 - 13: - NVMe support for new callouts. Patch 14 - 18: - LIO support for new callouts. This patchset has been tested with the libiscsi PGR ops and with window's failover cluster verification test. Note that for scsi backend devices we need this patchset: https://lore.kernel.org/linux-scsi/20230123221046.125483-1-michael.christie@oracle.com/T/#m4834a643ffb5bac2529d65d40906d3cfbdd9b1b7 to handle UAs. To reduce the size of this patchset that's being done separately to make reviewing easier. And to make merging easier this patchset and the one above do not have any conflicts so can be merged in different trees. Link: https://lore.kernel.org/r/20230407200551.12660-1-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-18Merge tag 'nvme-6.4-2023-05-18' of git://git.infradead.org/nvme into block-6.4Jens Axboe1-1/+5
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.4 - More device quirks (Sagi, Hristo, Adrian, Daniel) - Controller delete race (Maurizo) - Multipath cleanup fix (Christoph)" * tag 'nvme-6.4-2023-05-18' of git://git.infradead.org/nvme: nvme-pci: Add quirk for Teamgroup MP33 SSD nvme: do not let the user delete a ctrl before a complete initialization nvme-multipath: don't call blk_mark_disk_dead in nvme_mpath_remove_disk nvme-pci: clamp max_hw_sectors based on DMA optimized limitation nvme-pci: add quirk for missing secondary temperature thresholds nvme-pci: add NVME_QUIRK_BOGUS_NID for HS-SSD-FUTURE 2048G
2023-05-17nvme: do not let the user delete a ctrl before a complete initializationMaurizio Lombardi1-1/+5
If a userspace application performes a "delete_controller" command early during the ctrl initialization, the delete operation may race against the init code and the kernel will crash. nvme nvme5: Connect command failed: host path error nvme nvme5: failed to connect queue: 0 ret=880 PF: supervisor write access in kernel mode PF: error_code(0x0002) - not-present page blk_mq_quiesce_queue+0x18/0x90 nvme_tcp_delete_ctrl+0x24/0x40 [nvme_tcp] nvme_do_delete_ctrl+0x7f/0x8b [nvme_core] nvme_sysfs_delete.cold+0x8/0xd [nvme_core] kernfs_fop_write_iter+0x124/0x1b0 new_sync_write+0xff/0x190 vfs_write+0x1ef/0x280 Fix the crash by checking the NVME_CTRL_STARTED_ONCE bit; if it's not set it means that the nvme controller is still in the process of getting initialized and the kernel will return an -EBUSY error to userspace. Set the NVME_CTRL_STARTED_ONCE later in the nvme_start_ctrl() function, after the controller start operation is completed. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>