aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/gpu/drm/xe (follow)
AgeCommit message (Collapse)AuthorFilesLines
2025-10-23drm/xe: Fix an IS_ERR() vs NULL bug in xe_tile_alloc_vram()Dan Carpenter1-2/+2
commit 6c9e64e83b22405622d1f47417cdb0d20d49ca35 upstream. The xe_vram_region_alloc() function returns NULL on error. It never returns error pointers. Update the error checking to match. Fixes: 4b0a5f5ce784 ("drm/xe: Unify the initialization of VRAM regions") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/5449065e-9758-4711-b706-78771c0753c4@sabinyo.mountain Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Don't allow evicting of BOs in same VM in array of VM bindsMatthew Brost2-9/+25
[ Upstream commit 7ac74613e5f2ef3450f44fd2127198662c2563a9 ] An array of VM binds can potentially evict other buffer objects (BOs) within the same VM under certain conditions, which may lead to NULL pointer dereferences later in the bind pipeline. To prevent this, clear the allow_res_evict flag in the xe_bo_validate call. v2: - Invert polarity of no_res_evict (Thomas) - Add comment in code explaining issue (Thomas) Cc: stable@vger.kernel.org Reported-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6268 Fixes: 774b5fa509a9 ("drm/xe: Avoid evicting object of the same vm in none fault mode") Fixes: 77f2ef3f16f5 ("drm/xe: Lock all gpuva ops during VM bind IOCTL") Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Tested-by: Paulo Zanoni <paulo.r.zanoni@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20251009110618.3481870-1-matthew.brost@intel.com (cherry picked from commit 8b9ba8d6d95fe75fed6b0480bb03da4b321bea08) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> [ removed exec parameter from xe_bo_validate() calls ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Move rebar to be done earlierLucas De Marchi3-8/+29
[ Upstream commit d30203739be798d3de5c84db3060e96f00c54e82 ] There may be cases in which the BAR0 also needs to move to accommodate the bigger BAR2. However if it's not released, the BAR2 resize fails. During the vram probe it can't be released as it's already in use by xe_mmio for early register access. Add a new function in xe_vram and let xe_pci call it directly before even early device probe. This allows the BAR2 to resize in cases BAR0 also needs to move, assuming there aren't other reasons to hold that move: [] xe 0000:03:00.0: vgaarb: deactivate vga console [] xe 0000:03:00.0: [drm] Attempting to resize bar from 8192MiB -> 16384MiB [] xe 0000:03:00.0: BAR 0 [mem 0x83000000-0x83ffffff 64bit]: releasing [] xe 0000:03:00.0: BAR 2 [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing [] pcieport 0000:02:01.0: bridge window [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing [] pcieport 0000:01:00.0: bridge window [mem 0x4000000000-0x41ffffffff 64bit pref]: releasing [] pcieport 0000:01:00.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned [] pcieport 0000:02:01.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned [] xe 0000:03:00.0: BAR 2 [mem 0x4000000000-0x43ffffffff 64bit pref]: assigned [] xe 0000:03:00.0: BAR 0 [mem 0x83000000-0x83ffffff 64bit]: assigned [] pcieport 0000:00:01.0: PCI bridge to [bus 01-04] [] pcieport 0000:00:01.0: bridge window [mem 0x83000000-0x840fffff] [] pcieport 0000:00:01.0: bridge window [mem 0x4000000000-0x44007fffff 64bit pref] [] pcieport 0000:01:00.0: PCI bridge to [bus 02-04] [] pcieport 0000:01:00.0: bridge window [mem 0x83000000-0x840fffff] [] pcieport 0000:01:00.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref] [] pcieport 0000:02:01.0: PCI bridge to [bus 03] [] pcieport 0000:02:01.0: bridge window [mem 0x83000000-0x83ffffff] [] pcieport 0000:02:01.0: bridge window [mem 0x4000000000-0x43ffffffff 64bit pref] [] xe 0000:03:00.0: [drm] BAR2 resized to 16384M [] xe 0000:03:00.0: [drm:xe_pci_probe [xe]] BATTLEMAGE e221:0000 dgfx:1 gfx:Xe2_HPG (20.02) ... For BMG there are additional fix needed in the PCI side, but this helps getting it to a working resize. All the rebar logic is more pci-specific than xe-specific and can be done very early in the probe sequence. In future it would be good to move it out of xe_vram.c, but this refactor is left for later. Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: stable@vger.kernel.org # 6.12+ Link: https://lore.kernel.org/intel-xe/fafda2a3-fc63-ce97-d22b-803f771a4d19@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20250918-xe-pci-rebar-2-v1-2-6c094702a074@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 45e33f220fd625492c11e15733d8e9b4f9db82a4) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Unify the initialization of VRAM regionsPiotr Piórkowski10-125/+164
[ Upstream commit 4b0a5f5ce7849aab7a67ba9f113ed75626f6de36 ] Currently in the drivers we have defined VRAM regions per device and per tile. Initialization of these regions is done in two completely different ways. To simplify the logic of the code and make it easier to add new regions in the future, let's unify the way we initialize VRAM regions. v2: - fix doc comments in struct xe_vram_region - remove unnecessary includes (Jani) v3: - move code from xe_vram_init_regions_managers to xe_tile_init_noalloc (Matthew) - replace ioremap_wc to devm_ioremap_wc for mapping VRAM BAR (Matthew) - Replace the tile id parameter with vram region in the xe_pf_begin function. v4: - remove tile back pointer from struct xe_vram_region - add new back pointers: xe and migarte to xe_vram_region Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> # rev3 Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250714184818.89201-6-piotr.piorkowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Stable-dep-of: d30203739be7 ("drm/xe: Move rebar to be done earlier") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Move struct xe_vram_region to a dedicated headerPiotr Piórkowski13-72/+91
[ Upstream commit 7a20b4f558f4291161f71a5b7384262db9ccd6b0 ] Let's move the xe_vram_region structure to a new header dedicated to VRAM to improve modularity and avoid unnecessary dependencies when only VRAM-related structures are needed. v2: Fix build if CONFIG_DRM_XE_DEVMEM_MIRROR is enabled v3: Fix build if CONFIG_DRM_XE_DISPLAY is enabled v4: Move helper to get tile dpagemap to xe_svm.c Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Suggested-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> # rev3 Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250714184818.89201-4-piotr.piorkowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Stable-dep-of: d30203739be7 ("drm/xe: Move rebar to be done earlier") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Use dynamic allocation for tile and device VRAM region structuresPiotr Piórkowski16-77/+202
[ Upstream commit f92cfd72d9a650f90260c54accd840c6500c4c3a ] In future platforms, we will need to represent the device and tile VRAM regions in a more dynamic way, so let's abandon the static allocation of these structures and start use a dynamic allocation. v2: - Add a helpers for accessing fields of the xe_vram_region structure v3: - Add missing EXPORT_SYMBOL_IF_KUNIT for xe_vram_region_actual_physical_size Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250714184818.89201-3-piotr.piorkowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Stable-dep-of: d30203739be7 ("drm/xe: Move rebar to be done earlier") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe: Use devm_ioremap_wc for VRAM mapping and drop manual unmapPiotr Piórkowski1-4/+2
[ Upstream commit 922ae875230be91c7f05f2aa90d176b6693e2601 ] Let's replace the manual call to ioremap_wc function with devm_ioremap_wc function, ensuring that VRAM mappings are automatically released when the driver is detached. Since devm_ioremap_wc registers the mapping with the device's managed resources, the explicit iounmap call in vram_fini is no longer needed, so let's remove it. Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Acked-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250714184818.89201-2-piotr.piorkowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Stable-dep-of: d30203739be7 ("drm/xe: Move rebar to be done earlier") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe/evict: drop bogus assertMatthew Auld1-8/+0
[ Upstream commit 225bc03d85427e7e3821d6f99f4f2d4a09350dda ] This assert can trigger here with non pin_map users that select LATE_RESTORE, since the vmap is allowed to be NULL given that save/restore can now use the blitter instead. The check here doesn't seem to have much value anymore given that we no longer move pinned memory, so any existing vmap is left well alone, and doesn't need to be recreated upon restore, so just drop the assert here. Fixes: 86f69c26113c ("drm/xe: use backup object for pinned save/restore") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/6213 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20251010152457.177884-2-matthew.auld@intel.com (cherry picked from commit a10b4a69c7f8f596d2c5218fbe84430734fab3b2) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23drm/xe: Enable media sampler power gatingVinay Belgaumkar2-0/+9
[ Upstream commit 1852d27aa998272696680607b65a2ceac966104e ] Where applicable, enable media sampler power gating. Also, add it to the powergate_info debugfs. v2: Remove the sampler powergate status since it is cleared quickly anyway. v3: Use vcs mask (Rodrigo) and fix the version check for media v4: Remove extra spaces v5: Media samplers are independent of vcs mask, use Media version 1255 (Matt Roper) Fixes: 38e8c4184ea0 ("drm/xe: Enable Coarse Power Gating") Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20251010011047.2047584-1-vinay.belgaumkar@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 4cbc08649a54c3d533df9832342d52d409dfbbf0) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-23drm/xe: Increase global invalidation timeout to 1000usKenneth Graunke1-1/+1
commit e5ae8d1eb08a3e27fff4ae264af4c8056d908639 upstream. The previous timeout of 500us seems to be too small; panning the map in the Roll20 VTT in Firefox on a KDE/Wayland desktop reliably triggered timeouts within a few seconds of usage, causing the monitor to freeze and the following to be printed to dmesg: [Jul30 13:44] xe 0000:03:00.0: [drm] *ERROR* GT0: Global invalidation timeout [Jul30 13:48] xe 0000:03:00.0: [drm] *ERROR* [CRTC:82:pipe A] flip_done timed out I haven't hit a single timeout since increasing it to 1000us even after several multi-hour testing sessions. Fixes: 0dd2dd0182bc ("drm/xe: Move DSB l2 flush to a more sensible place") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5710 Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Cc: stable@vger.kernel.org Cc: Maarten Lankhorst <dev@lankhorst.se> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250912223254.147940-1-kenneth@whitecape.org Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 146046907b56578263434107f5a7d5051847c459) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-23drm/xe/guc: Check GuC running state before deregistering exec queueShuicheng Lin1-1/+12
commit 9f64b3cd051b825de0a2a9f145c8e003200cedd5 upstream. In normal operation, a registered exec queue is disabled and deregistered through the GuC, and freed only after the GuC confirms completion. However, if the driver is forced to unbind while the exec queue is still running, the user may call exec_destroy() after the GuC has already been stopped and CT communication disabled. In this case, the driver cannot receive a response from the GuC, preventing proper cleanup of exec queue resources. Fix this by directly releasing the resources when GuC is not running. Here is the failure dmesg log: " [ 468.089581] ---[ end trace 0000000000000000 ]--- [ 468.089608] pci 0000:03:00.0: [drm] *ERROR* GT0: GUC ID manager unclean (1/65535) [ 468.090558] pci 0000:03:00.0: [drm] GT0: total 65535 [ 468.090562] pci 0000:03:00.0: [drm] GT0: used 1 [ 468.090564] pci 0000:03:00.0: [drm] GT0: range 1..1 (1) [ 468.092716] ------------[ cut here ]------------ [ 468.092719] WARNING: CPU: 14 PID: 4775 at drivers/gpu/drm/xe/xe_ttm_vram_mgr.c:298 ttm_vram_mgr_fini+0xf8/0x130 [xe] " v2: use xe_uc_fw_is_running() instead of xe_guc_ct_enabled(). As CT may go down and come back during VF migration. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org Cc: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20251010172529.2967639-2-shuicheng.lin@intel.com (cherry picked from commit 9b42321a02c50a12b2beb6ae9469606257fbecea) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19drm/xe/uapi: loosen used tracking restrictionMatthew Auld1-9/+6
commit 2d1684a077d62fddfac074052c162ec6573a34e1 upstream. Currently this is hidden behind perfmon_capable() since this is technically an info leak, given that this is a system wide metric. However the granularity reported here is always PAGE_SIZE aligned, which matches what the core kernel is already willing to expose to userspace if querying how many free RAM pages there are on the system, and that doesn't need any special privileges. In addition other drm drivers seem happy to expose this. The motivation here if with oneAPI where they want to use the system wide 'used' reporting here, so not the per-client fdinfo stats. This has also come up with some perf overlay applications wanting this information. Fixes: 1105ac15d2a1 ("drm/xe/uapi: restrict system wide accounting") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Joshua Santosh <joshua.santosh.ranjan@intel.com> Cc: José Roberto de Souza <jose.souza@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250919122052.420979-2-matthew.auld@intel.com (cherry picked from commit 4d0b035fd6dae8ee48e9c928b10f14877e595356) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19drm/xe/i2c: Don't rely on d3cold.allowed flag in system PM pathRaag Jadav1-1/+1
[ Upstream commit 1af59cd5cc2b65d7fc95165f056695ce3f171133 ] In S3 and above sleep states, the device can loose power regardless of d3cold.allowed flag. Bring up I2C controller explicitly in system PM path to ensure its normal operation after losing power. v2: Cover S3 and above states (Rodrigo) Fixes: 0ea07b69517a ("drm/xe/pm: Wire up suspend/resume for I2C controller") Signed-off-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://lore.kernel.org/r/20250918103200.2952576-1-raag.jadav@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit e4863f1159befcd70df24fcb5458afaf2feab043) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-10-19drm/xe/hw_engine_group: Fix double write lock release in error pathShuicheng Lin1-5/+1
[ Upstream commit 08fdfd260e641da203f80aff8d3ed19c5ecceb7d ] In xe_hw_engine_group_get_mode(), a write lock is acquired before calling switch_mode(), which in turn invokes xe_hw_engine_group_suspend_faulting_lr_jobs(). On failure inside xe_hw_engine_group_suspend_faulting_lr_jobs(), the write lock is released there, and then again in xe_hw_engine_group_get_mode(), leading to a double release. Fix this by keeping both acquire and release operation in xe_hw_engine_group_get_mode(). Fixes: 770bd1d34113 ("drm/xe/hw_engine_group: Ensure safe transition between execution modes") Cc: Francois Dugast <francois.dugast@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://lore.kernel.org/r/20250925023145.1203004-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 662d98b8b373007fa1b08ba93fee11f6fd3e387c) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-22drm/xe: Don't copy pinned kernel bos twice on suspendThomas Hellström1-2/+2
We were copying the bo content the bos on the list "xe->pinned.late.kernel_bo_present" twice on suspend. Presumingly the intent is to copy the pinned external bos on the first pass. This is harmless since we (currently) should have no pinned external bos needing copy since a) exernal system bos don't have compressed content, b) We do not (yet) allow pinning of VRAM bos. Still, fix this up so that we copy pinned external bos on the first pass. We're about to allow bos pinned in VRAM. Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier") Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250918092207.54472-2-thomas.hellstrom@linux.intel.com (cherry picked from commit 9e69bafece43dcefec864f00b3ec7e088aa7fcbc) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-22drm/xe: Fix build with CONFIG_MODULES=nLucas De Marchi1-1/+1
When building with CONFIG_MODULES=n, the __exit functions are dropped. However our init functions may call them for error handling, so they are not good candidates for the exit sections. Fix this error reported by 0day: ld.lld: error: relocation refers to a symbol in a discarded section: xe_configfs_exit >>> defined in vmlinux.a(drivers/gpu/drm/xe/xe_configfs.o) >>> referenced by xe_module.c >>> drivers/gpu/drm/xe/xe_module.o:(init_funcs) in archive vmlinux.a This is the only exit function using __exit. Drop it to fix the build. Cc: Riana Tauro <riana.tauro@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506092221.1FmUQmI8-lkp@intel.com/ Fixes: 16280ded45fb ("drm/xe: Add configfs to enable survivability mode") Reviewed-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://lore.kernel.org/r/20250912-fix-nomodule-build-v1-1-d11b70a92516@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit d9b2623319fa20c2206754284291817488329648) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-22drm/xe/vf: Don't expose sysfs attributes not applicable for VFsMichal Wajdeczko1-1/+1
VFs can't read BMG_PCIE_CAP(0x138340) register nor access PCODE (already guarded by the info.skip_pcode flag) so we shouldn't expose attributes that require any of them to avoid errors like: [] xe 0000:03:00.1: [drm] Tile0: GT0: VF is trying to read an \ inaccessible register 0x138340+0x0 [] RIP: 0010:xe_gt_sriov_vf_read32+0x6c2/0x9a0 [xe] [] Call Trace: [] xe_mmio_read32+0x110/0x280 [xe] [] auto_link_downgrade_capable_show+0x2e/0x70 [xe] [] dev_attr_show+0x1a/0x70 [] sysfs_kf_seq_show+0xaa/0x120 [] kernfs_seq_show+0x41/0x60 Fixes: 0e414bf7ad01 ("drm/xe: Expose PCIe link downgrade attributes") Fixes: cdc36b66cd41 ("drm/xe: Expose fan control and voltage regulator version") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Lukasz Laguna <lukasz.laguna@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250916170029.3313-2-michal.wajdeczko@intel.com (cherry picked from commit a2d6223d224f333f705ed8495bf8bebfbc585c35) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-17drm/xe/guc: Set RCS/CCS yield policyDaniele Ceraolo Spurio6-5/+98
All recent platforms (including all the ones officially supported by the Xe driver) do not allow concurrent execution of RCS and CCS workloads from different address spaces, with the HW blocking the context switch when it detects such a scenario. The DUAL_QUEUE flag helps with this, by causing the GuC to not submit a context it knows will not be able to execute. This, however, causes a new problem: if RCS and CCS queues have pending workloads from different address spaces, the GuC needs to choose from which of the 2 queues to pick the next workload to execute. By default, the GuC prioritizes RCS submissions over CCS ones, which can lead to CCS workloads being significantly (or completely) starved of execution time. The driver can tune this by setting a dedicated scheduling policy KLV; this KLV allows the driver to specify a quantum (in ms) and a ratio (percentage value between 0 and 100), and the GuC will prioritize the CCS for that percentage of each quantum. Given that we want to guarantee enough RCS throughput to avoid missing frames, we set the yield policy to 20% of each 80ms interval. v2: updated quantum and ratio, improved comment, use xe_guc_submit_disable in gt_sanitize Fixes: d9a1ae0d17bd ("drm/xe/guc: Enable WA_DUAL_QUEUE for newer platforms") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Tested-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://lore.kernel.org/r/20250905235632.3333247-2-daniele.ceraolospurio@intel.com (cherry picked from commit 88434448438e4302e272b2a2b810b42e05ea024b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> [Rodrigo added #include "xe_guc_submit.h" while backporting]
2025-09-17drm/xe: Fix error handling if PXP fails to startDaniele Ceraolo Spurio6-41/+72
Since the PXP start comes after __xe_exec_queue_init() has completed, we need to cleanup what was done in that function in case of a PXP start error. __xe_exec_queue_init calls the submission backend init() function, so we need to introduce an opposite for that. Unfortunately, while we already have a fini() function pointer, it performs other operations in addition to cleaning up what was done by the init(). Therefore, for clarity, the existing fini() has been renamed to destroy(), while a new fini() has been added to only clean up what was done by the init(), with the latter being called by the former (via xe_exec_queue_fini). Fixes: 72d479601d67 ("drm/xe/pxp/uapi: Add userspace and LRC support for PXP-using queues") Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://lore.kernel.org/r/20250909221240.3711023-3-daniele.ceraolospurio@intel.com (cherry picked from commit 626667321deb4c7a294725406faa3dd71c3d445d) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-17drm/xe/sysfs: Add cleanup action in xe_device_sysfs_initZongyao Bai1-2/+6
On partial failure, some sysfs files created before the failure might not be removed. Add common cleanup step to remove them all immediately, as is should be harmless to attempt to remove non-existing files. Fixes: 0e414bf7ad01 ("drm/xe: Expose PCIe link downgrade attributes") Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Shuicheng Lin <shuicheng.lin@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Zongyao Bai <zongyao.bai@intel.com> Reviewed-by: Shuicheng Lin <shuicheng.lin@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250915214716.1327379-2-zongyao.bai@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 1a869168d91f1a1a2b0db22cea0295c67908e5d8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-16drm/xe: Fix a NULL vs IS_ERR() in xe_vm_add_compute_exec_queue()Dan Carpenter1-2/+2
The xe_preempt_fence_create() function returns error pointers. It never returns NULL. Update the error checking to match. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/aJTMBdX97cof_009@stanley.mountain Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 75cc23ffe5b422bc3cbd5cf0956b8b86e4b0e162) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-15drm/xe: defer free of NVM auxiliary container to device release callbackNitin Gote1-1/+4
Do not kfree the intel_dg_nvm_dev in xe_nvm_fini() right after auxiliary_device_delete/uninit. The auxiliary_device embeds the device/kobject (and its name); freeing it too early can race with asynchronous device_del/udev processing and cause a use-after-free. Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> Fixes: c28bfb107dac ("drm/xe/nvm: add on-die non-volatile memory device") Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250911052823.226696-1-nitin.r.gote@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit d4c3ed963e41d488695cf91068eabb8eb9f538ec) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-15drm/xe/hwmon: Remove type castingMallesh Koujalagi1-16/+19
Refactor: eliminate type casts by using proper u32 declarations. v2: - Address review comments. (Karthik) v3: - Use the proper u32 type and drop cast. (Lucas De Marchi) - Modify variable when actually using u64 value. - Change r value to reg_value with u32 type. v4: - Remove newline between trailer and Signed-off-by. (Lucas De Marchi) - Change reg_val to val for more user-friendly logging. - Use mul_u32_u32 function since both values are u32. v5: - mul_u32_u32 function with shift. (Lucas De Marchi) Fixes: 7596d839f6228 ("drm/xe/hwmon: Add support to manage power limits though mailbox") Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250912113458.2815172-1-mallesh.koujalagi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 4e1d3b5e6423dc841acd8691d75626b3d3b2b6a8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-15drm/xe/pf: Drop rounddown_pow_of_two fair LMEM limitationMichal Wajdeczko1-1/+0
This effectively reverts commit 4c3fe5eae46b ("drm/xe/pf: Limit fair VF LMEM provisioning") since we don't need it any more after non-contig VRAM allocations were fixed. This allows larger LMEM auto-provisioning for VFs, so instead: [ ] GT0: PF: LMEM available(14096M) fair(1 x 8192M) [ ] GT0: PF: VF1 provisioned with 8589934592 (8.00 GiB) LMEM or [ ] GT0: PF: LMEM available(14096M) fair(2 x 4096M) [ ] GT0: PF: VF1..VF2 provisioned with 4294967296 (4.00 GiB) LMEM we may get: [ ] GT0: PF: LMEM available(14096M) fair(1 x 14096M) [ ] GT0: PF: VF1 provisioned with 14780727296 (13.8 GiB) LMEM and [ ] GT0: PF: LMEM available(14096M) fair(2 x 7048M) [ ] GT0: PF: VF1..VF2 provisioned with 7390363648 (6.88 GiB) LMEM Fixes: 1e32ffbc9dc8 ("drm/xe/sriov: support non-contig VRAM provisioning") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Link: https://lore.kernel.org/r/20250910222439.32869-1-michal.wajdeczko@intel.com (cherry picked from commit 95c1cfa306087142989bff34ea0e05dcd95ddc58) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-15drm/xe/tile: Release kobject for the failure pathShuicheng Lin1-5/+7
Call kobject_put() for the failure path to release the kobject v2: remove extra newline. (Matt) Fixes: e3d0839aa501 ("drm/xe/tile: Abort driver load for sysfs creation failure") Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Link: https://lore.kernel.org/r/20250819153950.2973344-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit b98775bca99511cc22ab459a2de646cd2fa7241f) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-09drm/xe: Extend Wa_13011645652 to PTL-H, WCLJulia Filipchuk1-1/+2
Expand workaround to additional graphics architectures. Cc: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: intel-xe@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.17+ Signed-off-by: Julia Filipchuk <julia.filipchuk@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250903190122.1028373-2-julia.filipchuk@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 6fc957185e1691bb6dfa4193698a229db537c2a2) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-09drm/xe: Block exec and rebind worker while evicting for suspend / hibernateThomas Hellström6-1/+88
When the xe pm_notifier evicts for suspend / hibernate, there might be racing tasks trying to re-validate again. This can lead to suspend taking excessive time or get stuck in a live-lock. This behaviour becomes much worse with the fix that actually makes re-validation bring back bos to VRAM rather than letting them remain in TT. Prevent that by having exec and the rebind worker waiting for a completion that is set to block by the pm_notifier before suspend and is signaled by the pm_notifier after resume / wakeup. It's probably still possible to craft malicious applications that block suspending. More work is pending to fix that. v3: - Avoid wait_for_completion() in the kernel worker since it could potentially cause work item flushes from freezable processes to wait forever. Instead terminate the rebind workers if needed and re-launch at resume. (Matt Auld) v4: - Fix some bad naming and leftover debug printouts. - Fix kerneldoc. - Use drmm_mutex_init() for the xe->rebind_resume_lock (Matt Auld). - Rework the interface of xe_vm_rebind_resume_worker (Matt Auld). Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288 Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250904160715.2613-4-thomas.hellstrom@linux.intel.com (cherry picked from commit 599334572a5a99111015fbbd5152ce4dedc2f8b7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-09drm/xe: Allow the pm notifier to continue on failureThomas Hellström1-10/+7
Its actions are opportunistic anyway and will be completed on device suspend. Marking as a fix to simplify backporting of the fix that follows in the series. v2: - Keep the runtime pm reference over suspend / hibernate and document why. (Matt Auld, Rodrigo Vivi): Fixes: c6a4d46ec1d7 ("drm/xe: evict user memory in PM notifier") Cc: Matthew Auld <matthew.auld@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: <stable@vger.kernel.org> # v6.16+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250904160715.2613-3-thomas.hellstrom@linux.intel.com (cherry picked from commit ebd546fdffddfcaeab08afdd68ec93052c8fa740) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-09drm/xe: Attempt to bring bos back to VRAM after evictionThomas Hellström5-16/+16
VRAM+TT bos that are evicted from VRAM to TT may remain in TT also after a revalidation following eviction or suspend. This manifests itself as applications becoming sluggish after buffer objects get evicted or after a resume from suspend or hibernation. If the bo supports placement in both VRAM and TT, and we are on DGFX, mark the TT placement as fallback. This means that it is tried only after VRAM + eviction. This flaw has probably been present since the xe module was upstreamed but use a Fixes: commit below where backporting is likely to be simple. For earlier versions we need to open- code the fallback algorithm in the driver. v2: - Remove check for dgfx. (Matthew Auld) - Update the xe_dma_buf kunit test for the new strategy (CI) - Allow dma-buf to pin in current placement (CI) - Make xe_bo_validate() for pinned bos a NOP. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5995 Fixes: a78a8da51b36 ("drm/ttm: replace busy placement with flags v6") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.9+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250904160715.2613-2-thomas.hellstrom@linux.intel.com (cherry picked from commit cb3d7b3b46b799c96b54f8e8fe36794a55a77f0b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-09drm/xe/configfs: Don't touch survivability_mode on finiMichal Wajdeczko1-1/+2
This is a user controlled configfs attribute, we should not modify that outside the configfs attr.store() implementation. Fixes: bc417e54e24b ("drm/xe: Enable configfs support for survivability mode") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250904103521.7130-1-michal.wajdeczko@intel.com (cherry picked from commit 079a5c83dbd23db7a6eed8f558cf75e264d8a17b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-09-02drm/xe: Fix incorrect migration of backed-up object to VRAMThomas Hellström1-2/+1
If an object is backed up to shmem it is incorrectly identified as not having valid data by the move code. This means moving to VRAM skips the -EMULTIHOP step and the bo is cleared. This causes all sorts of weird behaviour on DGFX if an already evicted object is targeted by the shrinker. Fix this by using ttm_tt_is_swapped() to identify backed-up objects. Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5996 Fixes: 00c8efc3180f ("drm/xe: Add a shrinker for xe bos") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: <stable@vger.kernel.org> # v6.15+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250828134837.5709-1-thomas.hellstrom@linux.intel.com (cherry picked from commit 1047bd82794a1eab64d643f196d09171ce983f44) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26drm/xe: switch to local xbasename() helperCarlos Llamas1-1/+9
Commit b0a2ee5567ab ("drm/xe: prepare xe_gen_wa_oob to be multi-use") introduced a call to basename(). The GNU version of this function is not portable and fails to build with alternative libc implementations like musl or bionic. This causes the following build error: drivers/gpu/drm/xe/xe_gen_wa_oob.c:130:12: error: assignment to ‘const char *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion] 130 | fn = basename(fn); | ^ While a POSIX version of basename() could be used, it would require a separate header plus the behavior differs from GNU version in that it might modify its argument. Not great. Instead, implement a local xbasename() helper based on strrchr() that provides the same functionality and avoids portability issues. Fixes: b0a2ee5567ab ("drm/xe: prepare xe_gen_wa_oob to be multi-use") Suggested-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Tiffany Yang <ynaffit@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://lore.kernel.org/r/20250825155743.1132433-1-cmllamas@google.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 41be792f5baaf90d744a9a9e82994ce560ca9582) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26drm/xe: Don't trigger rebind on initial dma-buf validationMatthew Brost1-1/+2
On the first validate of an imported dma-buf (initial bind), the device has no GPU mappings, so a rebind is unnecessary. Rebinding here is harmful in multi-GPU setups and for VMs using preempt-fence mode, as it would evict in-flight GPU work. v2: - Drop dma_buf_validated, check for XE_PL_SYSTEM (Thomas) Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250825152841.3837378-1-matthew.brost@intel.com (cherry picked from commit ffdf968762e4fb3cdae54e811ec3525e67440a60) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26drm/xe/vm: Clear the scratch_pt pointer on errorThomas Hellström1-2/+6
Avoid triggering a dereference of an error pointer on cleanup in xe_vm_free_scratch() by clearing any scratch_pt error pointer. Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: 06951c2ee72d ("drm/xe: Use NULL PTEs as scratch PTEs") Cc: Brian Welty <brian.welty@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250821143045.106005-4-thomas.hellstrom@linux.intel.com (cherry picked from commit 358ee50ab565f3c8ea32480e9d03127a81ba32f8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26drm/xe/vm: Don't pin the vm_resv during validationThomas Hellström2-16/+4
The pinning has the odd side-effect that unlocking *any* resv during validation triggers an "unlocking pinned lock" warning. Cc: Matthew Brost <matthew.brost@intel.com> Fixes: 5cc3325584c4 ("drm/xe: Rework eviction rejection of bound external bos") Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250821143045.106005-2-thomas.hellstrom@linux.intel.com (cherry picked from commit 0a51bf3e54dd8b77e6f1febbbb66def0660862d2) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-26drm/xe/xe_sync: avoid race during ufence signalingZbigniew Kempczyński1-1/+1
Marking ufence as signalled after copy_to_user() is too late. Worker thread which signals ufence by memory write might be raced with another userspace vm-bind call. In map/unmap scenario unmap may still see ufence is not signalled causing -EBUSY. Change the order of marking / write to user-fence fixes this issue. Fixes: 977e5b82e090 ("drm/xe: Expose user fence from xe_sync_entry") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5536 Signed-off-by: Zbigniew Kempczyński <zbigniew.kempczynski@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250820083903.2109891-2-zbigniew.kempczynski@intel.com (cherry picked from commit 8ae04fe9ffc93d6bc3bc63ac08375427d69cee06) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-21drm/xe: Fix vm_bind_ioctl double free bugChristoph Manszewski1-1/+2
If the argument check during an array bind fails, the bind_ops are freed twice as seen below. Fix this by setting bind_ops to NULL after freeing. ================================================================== BUG: KASAN: double-free in xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] Free of addr ffff88813bb9b800 by task xe_vm/14198 CPU: 5 UID: 0 PID: 14198 Comm: xe_vm Not tainted 6.16.0-xe-eudebug-cmanszew+ #520 PREEMPT(full) Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-P DDR5 RVP, BIOS ADLPFWI1.R00.2411.A02.2110081023 10/08/2021 Call Trace: <TASK> dump_stack_lvl+0x82/0xd0 print_report+0xcb/0x610 ? __virt_addr_valid+0x19a/0x300 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] kasan_report_invalid_free+0xc8/0xf0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] check_slab_allocation+0x102/0x130 kfree+0x10d/0x440 ? should_fail_ex+0x57/0x2f0 ? xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] xe_vm_bind_ioctl+0x1b2/0x21f0 [xe] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? __lock_acquire+0xab9/0x27f0 ? lock_acquire+0x165/0x300 ? drm_dev_enter+0x53/0xe0 [drm] ? find_held_lock+0x2b/0x80 ? drm_dev_exit+0x30/0x50 [drm] ? drm_ioctl_kernel+0x128/0x1c0 [drm] drm_ioctl_kernel+0x128/0x1c0 [drm] ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] ? find_held_lock+0x2b/0x80 ? __pfx_drm_ioctl_kernel+0x10/0x10 [drm] ? should_fail_ex+0x57/0x2f0 ? __pfx_xe_vm_bind_ioctl+0x10/0x10 [xe] drm_ioctl+0x352/0x620 [drm] ? __pfx_drm_ioctl+0x10/0x10 [drm] ? __pfx_rpm_resume+0x10/0x10 ? do_raw_spin_lock+0x11a/0x1b0 ? find_held_lock+0x2b/0x80 ? __pm_runtime_resume+0x61/0xc0 ? rcu_is_watching+0x20/0x50 ? trace_irq_enable.constprop.0+0xac/0xe0 xe_drm_ioctl+0x91/0xc0 [xe] __x64_sys_ioctl+0xb2/0x100 ? rcu_is_watching+0x20/0x50 do_syscall_64+0x68/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fa9acb24ded Fixes: b43e864af0d4 ("drm/xe/uapi: Add DRM_XE_VM_BIND_FLAG_CPU_ADDR_MIRROR") Cc: Matthew Brost <matthew.brost@intel.com> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Christoph Manszewski <christoph.manszewski@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://lore.kernel.org/r/20250813101231.196632-2-christoph.manszewski@intel.com (cherry picked from commit a01b704527c28a2fd43a17a85f8996b75ec8492a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-21drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_createPiotr Piórkowski1-19/+15
Currently, ASID assignment for user VMs and page-table BO accounting for client memory tracking are performed in xe_vm_create_ioctl. To consolidate VM object initialization, move this logic to xe_vm_create. v2: - removed unnecessary duplicate BO tracking code - using the local variable xef to verify whether the VM is being created by userspace Fixes: 658a1c8e0a66 ("drm/xe: Assign ioctl xe file handler to vm in xe_vm_create") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-3-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 30e0c3f43a414616e0b6ca76cf7f7b2cd387e1d4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> [Rodrigo: Added fixes tag]
2025-08-19drm/xe: Assign ioctl xe file handler to vm in xe_vm_createPiotr Piórkowski4-8/+9
In several code paths, such as xe_pt_create(), the vm->xef field is used to determine whether a VM originates from userspace or the kernel. Previously, this handler was only assigned in xe_vm_create_ioctl(), after the VM was created by xe_vm_create(). However, xe_vm_create() triggers page table creation, and that function assumes vm->xef should be already set. This could lead to incorrect origin detection. To fix this problem and ensure consistency in the initialization of the VM object, let's move the assignment of this handler to xe_vm_create. v2: - take reference to the xe file object only when xef is not NULL - release the reference to the xe file object on the error path (Matthew) Fixes: 7f387e6012b6 ("drm/xe: add XE_BO_FLAG_PINNED_LATE_RESTORE") Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-2-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 9337166fa1d80f7bb7c7d3a8f901f21c348c0f2a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-14drm/xe/pf: Set VF LMEM BAR sizeMichał Winiarski2-0/+23
LMEM is partitioned between multiple VFs and we expect that the more VFs we have, the less LMEM is assigned to each VF. This means that we can achieve full LMEM BAR access without the need to attempt full VF LMEM BAR resize via pci_resize_resource(). Always try to set the largest possible BAR size that allows to fit the number of enabled VFs and inform the user in case the resize attempt is not successful. Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20250527120637.665506-7-michal.winiarski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 32a4d1b98e6663101fd0abfaf151c48feea7abb1) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-12drm/xe/hwmon: Add SW clamp for power limits writesKarthik Poosa1-0/+29
Clamp writes to power limits powerX_crit/currX_crit, powerX_cap, powerX_max, to the maximum supported by the pcode mailbox when sysfs-provided values exceed this limit. Although the pcode already performs clamping, values beyond the pcode mailbox's supported range get truncated, leading to incorrect critical power settings. This patch ensures proper clamping to prevent such truncation. v2: - Address below review comments. (Riana) - Split comments into multiple sentences. - Use local variables for readability. - Add a debug log. - Use u64 instead of unsigned long. v3: - Change drm_dbg logs to drm_info. (Badal) v4: - Rephrase the drm_info log. (Rodrigo, Riana) - Rename variable max_mbx_power_limit to max_supp_power_limit, as limit is same for platforms with and without mailbox power limit support. Signed-off-by: Karthik Poosa <karthik.poosa@intel.com> Fixes: 92d44a422d0d ("drm/xe/hwmon: Expose card reactive critical power") Fixes: fb1b70607f73 ("drm/xe/hwmon: Expose power attributes") Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://lore.kernel.org/r/20250808185310.3466529-1-karthik.poosa@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit d301eb950da59f962bafe874cf5eb6d61a85b2c2) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-12drm/xe: Defer buffer object shrinker write-backs and GPU waitsThomas Hellström1-4/+47
When the xe buffer-object shrinker allows GPU waits and write-back, (typically from kswapd), perform multiple passes, skipping subsequent passes if the shrinker number of scanned objects target is reached. 1) Without GPU waits and write-back 2) Without write-back 3) With both GPU-waits and write-back This is to avoid stalls and costly write- and readbacks unless they are really necessary. v2: - Don't test for scan completion twice. (Stuart Summers) - Update tags. Reported-by: melvyn <melvyn2@dnsense.pub> Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/5557 Cc: Summers Stuart <stuart.summers@intel.com> Fixes: 00c8efc3180f ("drm/xe: Add a shrinker for xe bos") Cc: <stable@vger.kernel.org> # v6.15+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250805074842.11359-1-thomas.hellstrom@linux.intel.com (cherry picked from commit 80944d334182ce5eb27d00e2bf20a88bfc32dea1) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-12drm/xe/migrate: prevent potential UAFMatthew Auld1-4/+5
If we hit the error path, the previous fence (if there is one) has already been put() prior to this, so doing a fence_wait could lead to UAF. Tweak the flow to do to the put() until after we do the wait. Fixes: 270172f64b11 ("drm/xe: Update xe_ttm_access_memory to use GPU for non-visible access") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250731093807.207572-8-matthew.auld@intel.com (cherry picked from commit 9b7ca35ed28fe5fad86e9d9c24ebd1271e4c9c3e) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-12drm/xe/migrate: don't overflow max copy sizeMatthew Auld1-0/+6
With non-page aligned copy, we need to use 4 byte aligned pitch, however the size itself might still be close to our maximum of ~8M, and so the dimensions of the copy can easily exceed the S16_MAX limit of the copy command leading to the following assert: xe 0000:03:00.0: [drm] Assertion `size / pitch <= ((s16)(((u16)~0U) >> 1))` failed! platform: BATTLEMAGE subplatform: 1 graphics: Xe2_HPG 20.01 step A0 media: Xe2_HPM 13.01 step A1 tile: 0 VRAM 10.0 GiB GT: 0 type 1 WARNING: CPU: 23 PID: 10605 at drivers/gpu/drm/xe/xe_migrate.c:673 emit_copy+0x4b5/0x4e0 [xe] To fix this account for the pitch when calculating the number of current bytes to copy. Fixes: 270172f64b11 ("drm/xe: Update xe_ttm_access_memory to use GPU for non-visible access") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250731093807.207572-7-matthew.auld@intel.com (cherry picked from commit 8c2d61e0e916e077fda7e7b8e67f25ffe0f361fc) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-12drm/xe/migrate: prevent infinite recursionMatthew Auld1-12/+17
If the buf + offset is not aligned to XE_CAHELINE_BYTES we fallback to using a bounce buffer. However the bounce buffer here is allocated on the stack, and the only alignment requirement here is that it's naturally aligned to u8, and not XE_CACHELINE_BYTES. If the bounce buffer is also misaligned we then recurse back into the function again, however the new bounce buffer might also not be aligned, and might never be until we eventually blow through the stack, as we keep recursing. Instead of using the stack use kmalloc, which should respect the power-of-two alignment request here. Fixes a kernel panic when triggering this path through eudebug. v2 (Stuart): - Add build bug check for power-of-two restriction - s/EINVAL/ENOMEM/ Fixes: 270172f64b11 ("drm/xe: Update xe_ttm_access_memory to use GPU for non-visible access") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Maciej Patelczyk <maciej.patelczyk@intel.com> Cc: Stuart Summers <stuart.summers@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250731093807.207572-6-matthew.auld@intel.com (cherry picked from commit 38b34e928a08ba594c4bbf7118aa3aadacd62fff) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-08Merge tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernelLinus Torvalds10-8/+112
Pull drm fixes from Dave Airlie: "This is the fixes that built up in the merge window, mostly amdgpu and xe with one i915 display fix, seems like things are pretty good for rc1. i915: - DP LPFS fixes xe: - SRIOV: PF fixes and removal of need of module param - Fix driver unbind around Devcoredump - Mark xe driver as BROKEN if kernel page size is not 4kB amdgpu: - GC 9.5.0 fixes - SMU fix - DCE 6 DC fixes - mmhub client ID fixes - VRR fix - Backlight fix - UserQ fix - Legacy reset fix - Misc fixes amdkfd: - CRIU fix - Debugfs fix" * tag 'drm-next-2025-08-08' of https://gitlab.freedesktop.org/drm/kernel: (28 commits) drm/amdgpu: add missing vram lost check for LEGACY RESET drm/amdgpu/discovery: fix fw based ip discovery drm/amdkfd: Destroy KFD debugfs after destroy KFD wq amdgpu/amdgpu_discovery: increase timeout limit for IFWI init drm/amdgpu: Update SDMA firmware version check for user queue support drm/amdgpu: Add NULL check for asic_funcs drm/amd/display: Revert "drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value" drm/amd/display: fix a Null pointer dereference vulnerability drm/amd/display: Add primary plane to commits for correct VRR handling drm/amdgpu: update mmhub 3.3 client id mappings drm/amdgpu: update mmhub 3.0.1 client id mappings drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job drm/amd/display: Fix DCE 6.0 and 6.4 PLL programming. drm/amd/display: Don't overwrite dce60_clk_mgr drm/amdkfd: Fix checkpoint-restore on multi-xcc drm/amd: Restore cached manual clock settings during resume drm/amd: Restore cached power limit during resume drm/amdgpu: Update external revid for GC v9.5.0 drm/amdgpu: Update supported modes for GC v9.5.0 Mark xe driver as BROKEN if kernel page size is not 4kB ...
2025-08-04Mark xe driver as BROKEN if kernel page size is not 4kBSimon Richter1-0/+1
This driver, for the time being, assumes that the kernel page size is 4kB, so it fails on loong64 and aarch64 with 16kB pages, and ppc64el with 64kB pages. Signed-off-by: Simon Richter <Simon.Richter@hogyros.de> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: stable@vger.kernel.org # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://lore.kernel.org/r/20250802024152.3021-1-Simon.Richter@hogyros.de (cherry picked from commit 0521a868222ffe636bf202b6e9d29292c1e19c62) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-04drm/xe/pf: Make sure PF is ready to configure VFsMichal Wajdeczko6-2/+63
The PF driver might be resumed just to configure VFs, but since it is doing some asynchronous GuC reconfigurations after fresh reset, we should wait until all pending works are completed. This is especially important in case of LMEM provisioning, since we also need to update the LMTT and send invalidation requests to all GuCs, which are expected to be already in the VGT mode. Fixes: 68ae022278a1 ("drm/xe/pf: Force GuC virtualization mode") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250801142822.180530-3-michal.wajdeczko@intel.com (cherry picked from commit c6c86441c465ea440dfb5039f1c26e629a6fd64c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-04drm/xe/pf: Disable PF restart worker on device removalMichal Wajdeczko1-1/+31
We can't let restart worker run once device is removed, since other data that it might want to access could be already released. Explicitly disable worker as part of device cleanup action. Fixes: a4d1c5d0b99b ("drm/xe/pf: Move VFs reprovisioning to worker") Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://lore.kernel.org/r/20250801142822.180530-2-michal.wajdeczko@intel.com (cherry picked from commit a424353937c24554bb242a6582ed8f018b4a411c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-08-04drm/xe/devcoredump: Defer devcoredump initialization during probeBalasubramani Vivekanandan2-4/+10
Doing devcoredump initializing before GT though look harmless, it leads to problem during driver unbind. Because of this order, GT/Engine release functions will be called before xe devcoredump release function (xe_driver_devcoredump_fini) leading to the following kernel crash[1] because the devcoredump functions might still use GT/Engine datastructures after those are freed. The following crash is observed while running the IGT xe_wedged@wedged-at-any-timeout. The test forces a wedged state by submitting a workload which hangs. Then does a unbind/rebind of the driver to recover from the wedged state. The hanged workload leads to a devcoredump. The following crash is noticed when the devcoredump capture races with the driver unbind. During driver unbind, the release function hw_engine_fini() will be called which assigns NULL to hwe->gt. But the same data structure is accessed during the coredump capture in the function xe_engine_snapshot_print by reading snapshot->hwe->gt. With this patch, we make sure the devcoredump is stopped before deinitializing the core driver functions. [1]: BUG: kernel NULL pointer dereference, address: 0000000000000000 Workqueue: events_unbound xe_devcoredump_deferred_snap_work [xe] RIP: 0010:xe_engine_snapshot_print+0x47/0x420 [xe] Call Trace: <TASK> ? drm_printf+0x64/0x90 __xe_devcoredump_read+0x23f/0x2d0 [xe] ? __pfx___drm_printfn_coredump+0x10/0x10 ? __pfx___drm_puts_coredump+0x10/0x10 xe_devcoredump_deferred_snap_work+0x17a/0x190 [xe] process_one_work+0x22e/0x6f0 worker_thread+0x1e8/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x11f/0x250 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x47/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 v2: Detailed commit description (Rodrigo) v3: FIXME added (Rodrigo, Stuart) Fixes: 4209d635a823 ("drm/xe: Remove devcoredump during driver release") Reviewed-by: Stuart Summers <stuart.summers@intel.com> Link: https://lore.kernel.org/r/20250731061300.14320-1-balasubramani.vivekanandan@intel.com Signed-off-by: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com> Link: https://lore.kernel.org/r/20250801052356.21885-1-balasubramani.vivekanandan@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 1fdc4c381ff765479d76ccf3134717c430c871b8) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>