aboutsummaryrefslogtreecommitdiffstats
path: root/drivers (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-05-20vfio/mlx5: Enable the DMA link APILeon Romanovsky3-203/+143
Remove intermediate scatter-gather table completely and enable new DMA link API. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/f71638d50c9c79a462f2e0423501b1de77617656.1747747694.git.leon@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20vfio/mlx5: Rewrite create mkey flow to allow better code reuseLeon Romanovsky2-70/+91
Change the creation of mkey to be performed in multiple steps: data allocation, DMA setup and actual call to HW to create that mkey. In this new flow, the whole input to MKEY command is saved to eliminate the need to keep array of pointers for DMA addresses for receive list and in the future patches for send list too. In addition to memory size reduce and elimination of unnecessary data movements to set MKEY input, the code is prepared for future reuse. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/d4ad0384fbd1e23a607cbbe9e5756748f3a761d9.1747747694.git.leon@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20vfio/mlx5: Explicitly use number of pages instead of allocated lengthLeon Romanovsky3-41/+57
allocated_length is a multiple of page size and number of pages, so let's change the functions to accept number of pages. This improves code readability, simplifies buffer handling, and enables combining DMA send/receive operations, as will be introduced in the next patches. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/76f39993d2ca0311b3bcfe56038a669d03926815.1747747694.git.leon@kernel.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20Merge branch 'dma-mapping-for-6.16-two-step-api' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux into v6.16/vfio/nextAlex Williamson3-125/+479
Merge two step DMA mapping API as basis for mlx5-vfio-pci uses. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()Rob Clark1-13/+14
In situations where mapping/unmapping sequence can be controlled by userspace, attempting to map over a region that has not yet been unmapped is an error. But not something that should spam dmesg. Now that there is a quirk, we can also drop the selftest_running flag, and use the quirk instead for selftests. Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Rob Clark <robdclark@chromium.org> Link: https://lore.kernel.org/r/20250519175348.11924-6-robdclark@gmail.com [will: Rename quirk to IO_PGTABLE_QUIRK_NO_WARN per Robin's suggestion] Signed-off-by: Will Deacon <will@kernel.org>
2025-05-20octeontx2-pf: Add tracepoint for NIX_PARSE_SSubbaraya Sundeep3-1/+32
The NIX_PARSE_S structure populated by hardware in the NIX RX CQE has parsing information for the received packet. A tracepoint to dump the all words of NIX_PARSE_S is helpful in debugging packet parser. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/1747331048-15347-1-git-send-email-sbhatta@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20net: phy: make mdio consumer / device layer a separate moduleHeiner Kallweit6-45/+17
After having factored out the provider part from mdio_bus.c, we can make the mdio consumer / device layer a separate module. This also allows to remove Kconfig symbol MDIO_DEVICE. The module init / exit functions from mdio_bus.c no longer have to be called from phy_device.c. The link order defined in drivers/net/phy/Makefile ensures that init / exit functions are called in the right order. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/dba6b156-5748-44ce-b5e2-e8dc2fcee5a7@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20platform/x86: think-lmi: Fix attribute name usage for non-compliant itemsMark Pearson2-12/+15
A few, quite rare, WMI attributes have names that are not compatible with filenames, e.g. "Intel VT for Directed I/O (VT-d)". For these cases the '/' gets replaced with '\' for display, but doesn't get switched again when doing the WMI access. Fix this by keeping the original attribute name and using that for sending commands to the BIOS Fixes: a40cd7ef22fb ("platform/x86: think-lmi: Add WMI interface support on Lenovo platforms") Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca> Link: https://lore.kernel.org/r/20250520005027.3840705-1-mpearson-lenovo@squebb.ca Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-05-20platform/x86: thinkpad_acpi: Ignore battery threshold change event notificationMark Pearson1-0/+5
If user modifies the battery charge threshold an ACPI event is generated. Confirmed with Lenovo FW team this is only generated on user event. As no action is needed, ignore the event and prevent spurious kernel logs. Reported-by: Derek Barbosa <debarbos@redhat.com> Closes: https://lore.kernel.org/platform-driver-x86/7e9a1c47-5d9c-4978-af20-3949d53fb5dc@app.fastmail.com/T/#m5f5b9ae31d3fbf30d7d9a9d76c15fb3502dfd903 Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20250517023348.2962591-1-mpearson-lenovo@squebb.ca Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-05-20spi: sh-msiof: Transfer size improvements and I2SMark Brown1-223/+134
Merge series from Geert Uytterhoeven <geert+renesas@glider.be>: This patch series (A) improves single transfer sizes in the MSIOF driver, using two methods: - By increasing the assumed FIFO sizes, impacting both PIO and DMA transfers, - By using two groups, impacting DMA transfers, and (B) lets the recently-introduced MSIOF I2S drive reuse the SPI driver's register definitions. All of this is covered with a thick sauce of fixes for (harmless) bugs, cleanups, and refactorings. Note that the driver uses the limitations as specified in the hardware documentation. For discovering the actual FIFO sizes, I wrote some crude test code that can be found at [2]. This is based on spi/for-next and sound-asoc/for-next, and has been tested on a variery of R-Car SoCs. [1] https://lore.kernel.org/cover.1746180072.git.geert+renesas@glider.be [2] https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git/log/?h=topic/msiof-fifo
2025-05-20Add sound card support for QCS9100 and QCS9075Mark Brown129-584/+1286
Merge series from Mohammad Rafi Shaik <mohammad.rafi.shaik@oss.qualcomm.com>: This patchset adds support for sound card on Qualcomm QCS9100 and QCS9075 boards.
2025-05-20regmap: Move selecting for REGMAP_MDIO and REGMAP_IRQAndrew Davis1-2/+2
If either REGMAP_IRQ or REGMAP_MDIO are set then REGMAP is also set. This then enables the selecting of IRQ_DOMAIN or MDIO_BUS from REGMAP based on the above two symbols respectively. This makes it very easy to end up with "circular dependencies". Instead select the IRQ_DOMAIN or MDIO_BUS from the symbols that make use of them. This is almost equivalent to before but makes it less likely to end up with false circular dependency detections. Signed-off-by: Andrew Davis <afd@ti.com> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Closes: https://lore.kernel.org/r/bfe991fa-f54c-4d58-b2e0-34c4e4eb48f4@linaro.org/ Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250516141722.13772-1-afd@ti.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-20i2c: core: add useful info when defer probeXu Yang1-2/+2
Add an useful info when failed to get irq/wakeirq due to -EPROBE_DEFER. Before: [ 15.737361] i2c 2-0050: deferred probe pending: (reason unknown) After: [ 15.816295] i2c 2-0050: deferred probe pending: tcpci: can't get irq Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Carlos Song <carlos.song@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-05-20ata: libata-eh: Keep DIPM disabled while modifying the allowed LPM statesNiklas Cassel1-2/+1
Currently, it is possible that LPM is enabled while calling the set_lpm() callback. The current code performs a SET FEATURES command to disable DIPM if policy < ATA_LPM_MED_POWER_WITH_DIPM, this means that it will currently disable DIPM for policies: ATA_LPM_UNKNOWN, ATA_LPM_MAX_POWER, ATA_LPM_MED_POWER (but not for policy ATA_LPM_MED_POWER_WITH_DIPM). The code called after calling the set_lpm() callback will later perform a SET FEATURES command to enable DIPM, if policy >= ATA_LPM_MED_POWER_WITH_DIPM. As we can see DIPM will not be disabled before calling set_lpm() if the LPM policy is: ATA_LPM_MED_POWER_WITH_DIPM, ATA_LPM_MIN_POWER_WITH_PARTIAL, or ATA_LPM_MIN_POWER. Make sure that we always disable DIPM before calling the set_lpm() callback. This is because the set_lpm() callback is the function (for AHCI) that sets the proper bits in PxSCTL.IPM, reflecting the support of the HBA. PxSCTL.IPM controls the LPM states that the device is allowed to enter. If the device tries to enter a state disabled by PxSCTL.IPM, the host will NAK the transition. If we do not disable DIPM before modifying PxSCTL.IPM, it is possible that DIPM will try (and will be allowed to) enter a LPM state that the HBA does not support (since we have not yet written PxSCTL.IPM, the HBA wasn't able to NAK the transition). While at it, remove the guard of host support for DIPM around the disabling of DIPM. While it makes sense to take host support for DIPM into account when enabling DIPM, it makes zero sense to take host support into account when disabling DIPM. If the host does not support DIPM, that is an even bigger reason why DIPM should be disabled on the device side. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Rename no_dipm variable to be more clearNiklas Cassel1-4/+5
Rename the no_dipm variable to host_has_dipm, by inverting the expression, and and also having a clearer name. No functional change. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Rename hipm and dipm variablesNiklas Cassel1-6/+9
Rename the hipm and dipm variables to have a clearer name. Also fold in the usage of no_dipm, as that is required in order to give the dipm variable a more descriptive name. No functional change. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Add ata_eh_set_lpm() WARN_ON_ONCENiklas Cassel1-0/+7
link->lpm_policy is initialized to ATA_LPM_UNKNOWN in ata_eh_reset(). ata_eh_set_lpm() is then only called if link->lpm_policy != ap->target_lpm_policy (after reset) and then only if link->lpm_policy > ATA_LPM_MAX_POWER (before revalidation). This means that ata_eh_set_lpm() is currently never called with policy == ATA_LPM_UNKNOWN. Add a WARN_ON_ONCE so that it is more obvious from reading the code that this function is never called with policy == ATA_LPM_UNKNOWN. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Update DIPM comments to reflect realityNiklas Cassel1-5/+8
The comments describing which LPM policies that has DIPM enabled predates the introduction of the LPM policies ATA_LPM_MIN_POWER_WITH_PARTIAL and ATA_LPM_MED_POWER_WITH_DIPM. Update the DIPM comments to reflect reality. Also remove the sentence that claims that "Order device and link configurations such that the host always allows DIPM requests." This comment is written before 24e0e61db3cb ("ata: libata: disallow dev-initiated LPM transitions to unsupported states"). Even though the set_lpm() call is done before enabling DIPM, the host will not always allow DIPM requests. For all LPM polcies where DIPM is enabled, only DIPM requests to LPM states that are supported by the HBA will be allowed. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20PCI: Remove hybrid-devres usage warnings from kernel-docPhilipp Stanner1-16/+0
pci/iomap.c still contains warnings about those functions not behaving in a managed manner if pcim_enable_device() was called. Since all hybrid behavior that users could know about has been removed by now, those explicit warnings are no longer necessary. Remove the hybrid-devres usage warnings from the kernel-doc. Signed-off-by: Philipp Stanner <phasta@kernel.org> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-8-phasta@kernel.org
2025-05-20PCI: Remove redundant set of request functionsPhilipp Stanner1-105/+5
When the demangling of the hybrid devres functions within PCI was implemented, it was necessary to implement several PCI functions a second time to avoid cyclic calls, since the hybrid functions in pci.c call the managed functions in devres.c, which in turn can be directly used outside of PCI and needed request infrastructure, too. Therefore, __pcim_request_region_range(), __pci_release_region_range() and wrappers around them were implemented. The hybrid nature has recently been removed from all functions in pci.c. Therefore, the functions in devres.c can now directly use their counterparts in pci.c without causing a call-cycle. Remove __pcim_request_region_range(), __pcim_request_region_range() and the wrappers. Use the corresponding request functions from pci.c in devres.c Signed-off-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-7-phasta@kernel.org
2025-05-20PCI: Remove exclusive requests flags from _pcim_request_region()Philipp Stanner1-21/+15
pcim_request_region_exclusive(), the only user in PCI devres that needed exclusive region requests, has been removed. All features related to exclusive requests can, therefore, be removed, too. Remove them. Signed-off-by: Philipp Stanner <phasta@kernel.org> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-6-phasta@kernel.org
2025-05-20gpiolib: remove unneeded #ifdefBartosz Golaszewski1-2/+0
We are already within another `#ifdef CONFIG_GPIOLIB_IRQCHIP` in gpiochip_to_irq() so there's no need for another guard. Remove it. Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-3-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20gpio: mpc8xxx: select GPIOLIB_IRQCHIPBartosz Golaszewski1-1/+1
This driver uses gpiochip_irq_reqres() and gpiochip_irq_relres() which are only built with GPIOLIB_IRQCHIP=y. Add the missing Kconfig select. Fixes: 7688a54d5b53 ("gpio: mpc8xxx: Make irq_chip immutable") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505180309.1nosQMkI-lkp@intel.com/ Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-2-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20gpio: pxa: select GPIOLIB_IRQCHIPBartosz Golaszewski1-0/+1
This driver uses gpiochip_irq_reqres() and gpiochip_irq_relres() which are only built with GPIOLIB_IRQCHIP=y. Add the missing Kconfig select. Fixes: 20117cf426b6 ("gpio: pxa: Make irq_chip immutable") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505181429.mzyIatOU-lkp@intel.com/ Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-1-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20cpufreq: scmi: Skip SCMI devices that aren't used by the CPUsMike Tipton1-1/+35
Currently, all SCMI devices with performance domains attempt to register a cpufreq driver, even if their performance domains aren't used to control the CPUs. The cpufreq framework only supports registering a single driver, so only the first device will succeed. And if that device isn't used for the CPUs, then cpufreq will scale the wrong domains. To avoid this, return early from scmi_cpufreq_probe() if the probing SCMI device isn't referenced by the CPU device phandles. This keeps the existing assumption that all CPUs are controlled by a single SCMI device. Signed-off-by: Mike Tipton <quic_mdtipton@quicinc.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Cristian Marussi <cristian.marussi@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-05-20Merge branch 'rust/cpufreq-dt' into cpufreq/arm/linux-nextViresh Kumar6-230/+408
2025-05-20cpufreq: Add Rust-based cpufreq-dt driverViresh Kumar3-0/+239
Introduce a Rust-based implementation of the cpufreq-dt driver, covering most of the functionality provided by the existing C version. Some features, such as retrieving platform data from `cpufreq-dt-platdev.c`, are still pending. The driver has been tested with QEMU, and frequency scaling works as expected. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-05-19hwmon: (isl28022) Fix current reading calculationYikai Tsai1-2/+4
According to the ISL28022 datasheet, bit15 of the current register is representing -32768. Fix the calculation to properly handle this bit, ensuring correct measurements for negative values. Signed-off-by: Yikai Tsai <yikai.tsai.wiwynn@gmail.com> Link: https://lore.kernel.org/r/20250519084055.3787-2-yikai.tsai.wiwynn@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2025-05-20nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_diskNilay Shroff3-14/+14
In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme: introduce multipath_always_on module paramNilay Shroff1-6/+66
Currently, a multipath head disk node is not created for single- ported NVMe adapters or private namespaces with non-unique NSID. However, creating a head node in these cases can help transparently handle transient PCIe link failures. Without a head node, features like delayed removal cannot be leveraged, making it difficult to tolerate such link failures. To address this, this commit introduces nvme_core module parameter multipath_always_on. When multipath_always_on is set to true, it forces the creation of a multipath head node regardless NVMe disk or namespace type. So this option allows the use of delayed removal of head node functionality even for single-ported NVMe disks and private namespaces with a unique NSID and thus helps transparently handle transient PCIe link failures. By default multipath_always_on is set to false, thus preserving the existing behavior. Setting it to true enables improved fault tolerance in PCIe setups. Moreover, please note that enabling this option would also implicitly enable nvme_core.multipath. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-multipath: introduce delayed removal of the multipath head nodeNilay Shroff4-14/+147
Currently, the multipath head node of an NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: derive and better document max segments limitsChristoph Hellwig1-6/+16
Redefine the max segments and max integrity limits based on the limiting factors. This keeps exactly the same values for 4k PAGE_SIZE systems, but increases the number of segments for larger page size as it properly derives the scatterlist allocation based limit for them instead of assuming a 4k PAGE_SIZE. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2025-05-20nvme-pci: use struct_size for allocation struct nvme_devChristoph Hellwig1-2/+2
This avoids open coding the variable size array arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: add a symolic name for the small pool sizeLeon Romanovsky1-6/+8
Open coding magic numbers in multiple places is never a good idea. Signed-off-by: Leon Romanovsky <leon@kernel.org> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: use a better encoding for small prp pool allocationsChristoph Hellwig1-43/+39
Add a separate flag to encode that the transfer is using the small page sized pool, and use a normal 0..n count for the number of descriptors. Contains improvements and suggestions from Kanchan Joshi <joshi.k@samsung.com> and Leon Romanovsky <leon@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: rename the descriptor poolsChristoph Hellwig1-42/+42
They are used for both PRPs and SGLs, and we use descriptor elsewhere when referring to their allocations, so use that name here as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: remove struct nvme_descriptorChristoph Hellwig1-33/+24
There is no real point in having a union of two pointer types here, just use a void pointer as we mix and match types between the arms of the union between the allocation and freeing side already. Also rename the nr_allocations field to nr_descriptors to better describe what it does. Signed-off-by: Christoph Hellwig <hch@lst.de> [leon: ported forward to include metadata SGL support] Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2025-05-20nvme-pci: store aborted state in flags variableLeon Romanovsky1-4/+10
Instead of keeping dedicated "bool aborted" variable, switch to a flags flags that can be used for other flags as well. Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: don't try to use SGLs for metadata on the admin queueChristoph Hellwig1-1/+4
No admin command defined in an NVMe specification supports metadata, but to protect against vendor specific commands using metadata ensure that we don't try to use SGLs for metadata on the admin queue, as NVMe does not support SGLs on the admin queue for the PCI transport. Do this by checking if the data transfer has been setup using SGLs as that is required for using SGLs for metadata. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: make PRP list DMA pools per-NUMA-nodeCaleb Sander Mateos1-60/+84
NVMe commands with over 8 KB of discontiguous data allocate PRP list pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. These device-global spinlocks are a significant source of contention when many CPUs are submitting to the same NVMe devices. On a workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. Ideally, the dma_pools would be per-hctx to minimize contention. But that could impose considerable resource costs in a system with many NVMe devices and CPUs. As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each nvme_queue to the set of DMA pools corresponding to its device and its hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by about half, to 1.2%. Preventing the sharing of PRP list pages across NUMA nodes also makes them cheaper to initialize. Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: factor out a nvme_init_hctx_common() helperCaleb Sander Mateos1-13/+15
nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In preparation for adding more logic, factor out a nvme_init_hctx-common() helper. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-fc: do not reference lsrsp after failureDaniel Wagner1-3/+10
The lsrsp object is maintained by the LLDD. The lifetime of the lsrsp object is implicit. Because there is no explicit cleanup/free call into the LLDD, it is not safe to assume after xml_rsp_fails, that the lsrsp is still valid. The LLDD could have freed the object already. With the recent changes how fcloop tracks the resources, this is the case. Thus don't access lsrsp after xml_rsp_fails. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: don't wait for lport cleanupDaniel Wagner1-17/+4
The lifetime of the fcloop_lsreq is not tight to the lifetime of the host or target port, thus there is no need anymore to synchronize the cleanup path anymore. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: add missing fcloop_callback_host_doneDaniel Wagner1-3/+9
Add the missing fcloop_call_host_done calls so that the caller frees resources when something goes wrong. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: take tgtport refs for portentryDaniel Wagner1-5/+39
Ensure that the tgtport is not going away as long portentry has a pointer on it. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: free pending reqs on tgtport unregisterDaniel Wagner1-7/+34
When nvmet_fc_unregister_targetport is called by the LLDD, it's not possible to communicate with the host, thus all pending request will not be process. Thus explicitly free them. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: drop response if targetport is goneDaniel Wagner1-5/+14
When the target port is gone, the lsrsp pointer is invalid. Thus don't call the done function anymore instead just drop the response. This happens when the target sends a disconnect association. After this the target starts tearing down all resources and doesn't expect any response. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: allocate/free fcloop_lsreq directlyDaniel Wagner1-18/+45
fcloop depends on the host or the target to allocate the fcloop_lsreq object. This means that the lifetime of the fcloop_lsreq is tied to either the host or the target. Consequently, the host or the target must cooperate during shutdown. Unfortunately, this approach does not work well when the target forces a shutdown, as there are dependencies that are difficult to resolve in a clean way. The simplest solution is to decouple the lifetime of the fcloop_lsreq object by managing them directly within fcloop. Since this is not a performance-critical path and only a small number of LS objects are used during setup and cleanup, it does not significantly impact performance to allocate them during normal operation. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: prevent double port deletionDaniel Wagner1-2/+17
The delete callback can be called either via the unregister function or from the transport directly. Thus it is necessary ensure resources are not freed multiple times. Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: access fcpreq only when holding reqlockDaniel Wagner1-15/+16
The abort handling logic expects that the state and the fcpreq are only accessed when holding the reqlock lock. While at it, only handle the aborts in the abort handler. Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>