Topic/drm amdgpu gtt lock timeout ms by chun-wan · Pull Request #215 · ROCm/amdgpu

chun-wan · 2026-05-20T20:07:03Z

Add a new module parameter gtt_lock_timeout_ms (default 0 = unbounded,
byte-identical to the current mutex_lock() behaviour) that caps the
wait time for adev->mman.gtt_window_lock in the three amdgpu_ttm.c
paths that take it. At the deadline the helper returns -ETIME so the
caller fails fast instead of parking indefinitely on a wedged SDMA ring.

Motivation

In multi-tenant HIP/AMDGPU serving workloads, a single wedged SDMA
ring can park every caller of amdgpu_ttm_copy_mem_to_mem(),
amdgpu_ttm_clear_buffer(), and amdgpu_fill_buffer() on the global
mman.gtt_window_lock mutex for minutes-plus while a single in-flight
copy waits for the ring. The hold time is bounded only by the SDMA
recovery path (which itself may need tens of seconds). The legacy
unbounded mutex_lock() then converts a localised SDMA hang into a
system-wide GTT-window stall affecting every VRAM-touching ioctl from
every tenant on the device.

gtt_lock_timeout_ms lets operators opt the GTT window contention
path into a bounded-wait failure mode (return -ETIME) so the higher-
level survival policy can choose whether to retry, re-queue, or
surface the error to the application.

Technical Details

drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
Declare int amdgpu_gtt_lock_timeout_ms; + module_param_named()
- MODULE_PARM_DESC() + DOC comment. Default 0 = unbounded.
drivers/gpu/drm/amd/amdgpu/amdgpu.h
Add extern int amdgpu_gtt_lock_timeout_ms;.
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
Add static int amdgpu_ttm_lock_gtt_window(struct amdgpu_device *adev)
helper. When amdgpu_gtt_lock_timeout_ms == 0, the helper
degrades to a plain mutex_lock(&adev->mman.gtt_window_lock)
and returns 0 (byte-identical to current behaviour). Otherwise
it does a mutex_trylock() loop with schedule_timeout_interruptible(1ms)
bounded by the configured deadline, returning -ETIME on expiry
or -ERESTARTSYS on pending signal. Values below 100 ms are
clamped up to 100 ms to avoid a fail-only path.
Three callsites switch from mutex_lock(&...gtt_window_lock) to
r = amdgpu_ttm_lock_gtt_window(adev); if (r) return r;:
- amdgpu_ttm_copy_mem_to_mem() (SDMA buffer<->buffer copy)
- amdgpu_ttm_clear_buffer() (SDMA buffer clear/wipe)
- amdgpu_fill_buffer() (SDMA buffer fill)
  mutex_unlock() at each function's error label is unchanged.

JIRA ID

ROCM-21571

Test Plan

Reproducer: multi-tenant SDMA-bound workload with one wedged ring
plus concurrent VRAM-touching ioctls from co-tenants. Open-source
reproducer to be added in a follow-up test patch.

Expected behaviour:

gtt_lock_timeout_ms unset / 0: stock behaviour; all existing
tests pass byte-identically (helper degrades to mutex_lock()).
gtt_lock_timeout_ms=4000: under multi-tenant SDMA load, the
worst-case wait on the lock is ~4 seconds (matching the
configured deadline), with -ETIME propagated back to the caller
cleanly.

Test Result

All existing drm-tip self-tests pass at gtt_lock_timeout_ms=0
(default). No behavioural change for stock callers.
Multi-tenant SDMA reproducer confirms bounded ~4 s tail latency on
the GTT window-lock contention path with gtt_lock_timeout_ms=4000
vs unbounded minutes pre-patch.

Reviewers may suggest using mutex_lock_killable_timeout() if
preferred over the explicit trylock-and-sleep loop; the current form
keeps the dependency surface minimal (no kernel API additions).

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: chun-wan chun-wan@users.noreply.github.com

Some drivers have different flows for hibernation and suspend. If the driver opportunistically will skip thaw() then it needs a hint to know what is happening after the hibernate. Introduce a new symbol pm_hibernation_mode_is_suspend() that drivers can call to determine if suspending the system for this purpose. Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Tested-by: Kenneth Crudup <kenny@panix.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 495c8d3)

[Why] commit 530694f ("drm/amdgpu: do not resume device in thaw for normal hibernation") optimized the flow for systems that are going into S4 where the power would be turned off. Basically the thaw() callback wouldn't resume the device if the hibernation image was successfully created since the system would be powered off. This however isn't the correct flow for a system entering into s0i3 after the hibernation image is created. Some of the amdgpu callbacks have different behavior depending upon the intended state of the suspend. [How] Use pm_hibernation_mode_is_suspend() as an input to decide whether to run resume during thaw() callback. Reported-by: Ionut Nechita <ionut_n2001@yahoo.com> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4573 Tested-by: Ionut Nechita <ionut_n2001@yahoo.com> Fixes: 530694f ("drm/amdgpu: do not resume device in thaw for normal hibernation") Acked-by: Alex Deucher <alexander.deucher@amd.com> Tested-by: Kenneth Crudup <kenny@panix.com> Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.17+ <stable@vger.kernel.org> # 6.17+: 495c8d3: PM: hibernate: Add pm_hibernation_mode_is_suspend() Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> (cherry picked from commit 0a6e9e0)

It's caused by the commit: 1fcbc0b6a8 "drm/amd: Fix hybrid sleep" Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com> Reviewed-by: Bob Zhou <Bob.Zhou@amd.com>

Signed-off-by: Yang Su <Yang.Su2@amd.com>

Add gpu metrics definition which is only a set of gpu metrics attributes. A field is encoded by its id, type and number of instances. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

a. hmm_range is either NULL or a valid pointer so we do not need to set range to NULL ever. b. keep the hmm_range_free in the end irrespective of the other conditions to avoid some additional checks and also avoid double free issue. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com>

These were not set so soft recovery was inadvertantly disabled. Fixes: 6ac55ea ("drm/amdgpu: move reset support type checks into the caller") Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Update asic_invalidate_hdp and asic_flush_hdp function to check if ip function exist, if not return void v2: Use else/if (Kevin) Update function name (Lijo) Signed-off-by: Asad Kamal <asad.kamal@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Move everything to the supported resets masks rather than having an explicit misc checks for this. Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Remove the NULL check from amdgpu_hmm_range_free for hmm_pfns as caller is responsible not to call amdgpu_hmm_range_free more than once. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>

…EDID" This reverts commit 11b66b2. It's caused the Jira ticket: SWDEV-563655, so revert it temporarily. Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com>

Read CPER raw data from debugfs node "/sys/kernel/debug/dri/*/ amdgpu_ring_cper". Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Change-Id: I01753bf4a1052a22144f6c2758a39d2b91c2212d

Remove amdgpu_asic_flush_hdp & amdgpu_asic_invalidate_hdp functions and directly use the mapped ones Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>

Fix the error: drivers/gpu/drm/amd/amdgpu/../ras/ras_mgr/amdgpu_ras_mgr.c:132:undefined reference to `__udivdi3' Fixs:b5bae0f01786d("drm/amd/ras: Add amdgpu ras management function") Reported-by: kernel test robot <lkp@intel.com> Closes:https://lore.kernel.org/oe-kbuild-all/202510272144.6SUHUoWx-lkp@intel.com/ Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com>

Fix error injection parameter error. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

The IPID register value for bad page threshold CPER holds socket_id info now according to the latest definition. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Change-Id: I847509f0282e246a171194c4fdbe1dfe0b297bb0

When the EDID of an analog display is not available, we can't know the possible modes supported by the display. However, we still need to offer the user to select from a variety of common modes. It will be up to the user to select the best one, though. This is how it works on other operating systems as well as the legacy display code path in amdgpu. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com>

Check if we have an amdgpu_dm_connector->dc_sink first before adding common modes for analog outputs. If we don't have a sink yet we can safely skip this. Fixes: 0c9f9ca99238 ("drm/amd/display: Add common modes to analog displays without EDID") Signed-off-by: Harry Wentland <harry.wentland@amd.com> Reviewed-by: Timur Kristóf <timur.kristof@gmail.com>

Signed-off-by: Yang Su <Yang.Su2@amd.com>

v1: the driver should handle return value of smu_v13_0_6_printk_clk_levels() to return the correct size for sysfs reads. v2: fix the issue of size calculation error in smu_v13_0_6_print_clks() Fixes: 0354cd650daa ("drm/amd/pm: Avoid writing nulls into `pp_od_clk_voltage`") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Use the correct label to complete all cleanup work. Fixes: 4d154b1 ("drm/amd/pm: Add support for DPM policies") Fixes: d2e690ff5d3cf ("drm/amd/pm: Add temperature metrics sysfs entry") Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Add helper macros to define metrics struct definitions. It will define structs with field type followed by actual field. A helper macro is also added to initialize the field encoding for all fields and to initialize the field members to 0xFFs. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

Fill and publish GPU metrics in v1.9 format for SMUv13.0.6 SOCs Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

The contents of si_dpm.h seem to have been copied from the old radeon driver, including a lot of structs and fields which were only relevant to GPU generations even older than SI. A lot of these can be deleted without causing much churn to the actual SI DPM code. Let's delete them to make the code easier to understand. Signed-off-by: Timur Kristóf <timur.kristof@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

No functional modification involved. ./drivers/gpu/drm/amd/display/dc/resource/dcn401/dcn401_resource.c:1674:3-4: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

No functional modification involved. ./drivers/gpu/drm/amd/display/dc/resource/dcn32/dcn32_resource.c:1850:3-4: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

No functional modification involved. ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:7392:3-4: Unneeded semicolon. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=26821 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Commit e6a8a00 ("drm/amd/display: Rename dml2 to dml2_0 folder") renames the directory dml2 to dml2_0 in ./drivers/gpu/drm/amd/display/dc, but misses to adjust the file entry in AMD DISPLAY CORE - DML. Adjust the file entry after this directory renaming. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Resetting VCN resets the entire tile, including jpeg. When resetting the VCN, we need to ensure that JPEG data blocks are accessible and we also need to handle the JPEG queue. Add a helper function to restore the JPEG queue during the VCN reset. v2: split the jpeg helper in two, in the top helper we can stop the sched workqueues and attempt to wait for any outstanding fences. Then in the bottom helper, we can force completion, re-init the rings, and restart the sched workqueues (Alex) v3: merge patches 4 and 5 into one patch (Alex) Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

Fix the vcn reset sequence in vcn_v4_0_3_ring_reset() to restore JPEG power state and unlock the JPEG powergating mutex before running the JPEG ring post-reset helper. Fixes: c50beca39115 ("drm/amdgpu/vcn4.0.3: rework reset handling") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>

Ensure GFX engine is idle before switching PTL state to prevent register access violations and CP hang. This addresses the race condition where in-flight GPU commands could conflict with PTL state changes. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

KIQ access is not guaranteed to work reliably under all reset situations. Avoid flooding dmesg with HDP flush failure messages. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

sdma ring reset is not supported in SRIOV. kfd driver does not check reset mask, and could queue sdma ring reset during unmap_queues_cpsch. Avoid the ring reset for sriov. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

v4: use func "amdgpu_gfx_get_hdp_flush_mask" to get ref_and_mask for gfx9 through gfx12. v3: Unify the get_ref_and_mask function in amdgpu_gfx_funcs, to support both GFX11 and earlier generations v2: place "get_ref_and_mask" in amdgpu_gfx_funcs instead of amdgpu_ring, since this function only assigns the cp entry. v1: both gfx ring and mes ring use cp0 to flush hdp, cause conflict. use function get_ref_and_mask to assign the cp entry. reassign mes to use cp8 instead. Signed-off-by: chong li <chongli2@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com>

This patch allows kfd driver function correctly when AMD gpu devices got unplug/replug at run time. When an AMD gpu device got unplug kfd driver gracefully terminates existing kfd processes after stops all queues by sending SIGBUS to user process. After that user space can still use remaining AMD gpu devices. When all AMD gpu devices at system got removed kfd driver will not response new requests. Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices to function as usual. The purpose of this patch is having kfd driver behavior as expected during and after AMD gpu devices unplug/replug at run time. Signed-off-by: Xiaogang Chen<Xiaogang.Chen@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit a482d054b7e7c7f2a35161d79e6629fa0f7f29d1) Change-Id: Ie33ea428914708546f7f96a627747f01bc6fcfdd

… paths Ungate GPU CG/PG in device_fini_hw and device_halt to protect GPU register accesses, e.g. GC registers are accessed in amdgpu_irq_disable_all() and amdgpu_fence_driver_hw_fini(). Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> (cherry picked from commit 33fe740db26e0c94791dde8b2926d3ab36c9e6ae) Change-Id: I09895beaff20b3caf125b15e17bc330392552393

This reverts commit c61fab0. Reason for revert: revert the patch as it causes performance drop,tested by CE team. revert this patch requested by management team. Change-Id: I0db3f9f819554566e259bbb1292e7690db958ced

…ntation Separate the PTL (Peak Tops Limiter) control logic into a stable public API layer and an internal implementation layer. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Add F8 and VECTOR to amdgpu_ptl_fmt and PSP format mapping. Update PTL format strings and GFX format enum to keep PSP/KFD in sync. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Use a bitmap to track PTL disable requests from sysfs and profiler. PTL is only re-enabled once all sources have released their disable requests, avoiding premature enablement. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Add a new kernel module parameter 'amdgpu.ptl' to allow users to enable or disable PTL feature at driver loading time. Parameter values: *) 0 or -1: disable PTL (default) *) 1: enable PTL *) 2: permanently disable PTL Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Add an explicit check on cmd->resp.status after psp_cmd_submit_buf() returns to ensure PTL state is only updated on actual success. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

…ces" This reverts commit 1562589. Reason for revert: <this patch breaks dkms install> Change-Id: If54b6c5f703eb136db61770d5ddafcba22bf4620

Invalidating a dmabuf will impact other users of the shared BO. In the scenario where process A moves the BO, it needs to inform process B about the move and process B will need to update its page table. The commit fixes a synchronisation bug caused by the use of the ticket: it made amdgpu_vm_handle_moved behave as if updating the page table immediately was correct but in this case it's not. An example is the following scenario, with 2 GPUs and glxgears running on GPU0 and Xorg running on GPU1, on a system where P2P PCI isn't supported: glxgears: export linear buffer from GPU0 and import using GPU1 submit frame rendering to GPU0 submit tiled->linear blit Xorg: copy of linear buffer The sequence of jobs would be: drm_sched_job_run # GPU0, frame rendering drm_sched_job_queue # GPU0, blit drm_sched_job_done # GPU0, frame rendering drm_sched_job_run # GPU0, blit move linear buffer for GPU1 access # amdgpu_dma_buf_move_notify -> update pt # GPU0 It this point the blit job on GPU0 is still running and would likely produce a page fault. Cc: stable@vger.kernel.org Fixes: a448cb0 ("drm/amdgpu: implement amdgpu_gem_prime_move_notify v2") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com>

Firmware and monitoring tools may not be ready to receive a CPER when we read the bad pages, so send the CPERs at the end of RAS initialization to ensure that the FW is ready to receive and process the CPER. This removes the previous CPER submission that was added during bad page load, and sends both in-band and out-of-band at the same time. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com>

This patch allows kfd driver function correctly when AMD gpu devices got unplug/replug at run time. When an AMD gpu device got unplug kfd driver gracefully terminates existing kfd processes after stops all queues by sending SIGBUS to user process. After that user space can still use remaining AMD gpu devices. When all AMD gpu devices at system got removed kfd driver will not response new requests. Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices to function as usual. The purpose of this patch is having kfd driver behavior as expected during and after AMD gpu devices unplug/replug at run time. Signed-off-by: Xiaogang Chen<Xiaogang.Chen@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

GC 9.4.4 uses SPI busy status for idle detection instead of GRBM GUI_ACTIVE. Add version check to use SPI_BUSY for 9.4.4 while keeping GRBM_STATUS GUI_ACTIVE check for other GC versions. v2: move this check into amdgpu_ptl_perf_monitor_ctrl(Lijo) Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Move amdgpu_amdkfd_stop/start_sched calls from kfd_ptl_control() into amdgpu_ptl_perf_monitor_ctrl() so all PTL callers (KFD ioctl, sysfs, GFX init) get consistent scheduling management. Add amdgpu_amdkfd_stop/start_sched_all() wrappers to stop and restart KFD scheduling on all nodes without assuming node ID ordering. v3: * call start/stop for PTL Set Only v2: * move the stop/start sched function to amdgpu_ptl_perf_monitor_ctrl(Lijo) * add wrapper amdgpu_amdkfd_stop_sched_all and amdgpu_amdkfd_start_sched_all (Lijo) Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Set the AMDGPU_PTL_DISABLE_SYSFS bit in adev->psp.disable_bitmap during gfx_v9_4_3_perf_monitor_ptl_init(). This ensures that PTL is initially disabled via the SYSFS mechanism, matching the intended default state and preventing unintended PTL enablement before explicit user action. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Create PTL sysfs in xgmi_reset_on_init restore path for MINIMAL_XGMI Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Downgrade unhalt_cpsch warning to dev_dbg when sched is already stopped Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Only set the bit when PTL is actually being disabled (state=0) Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

end the function flow when ras table checksum is error Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com>

Handle RAS eeprom record when UMC_CHANNEL_IDX_V2 is set. v2: get UMC_CHANNEL_IDX_V2 flag before the clear of it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Make eeprom data and its counter consistent. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Fix this by skipping the sysfs disable mapping when the GPU is currently undergoing a reset or suspend flow. Additionally, add debug logging in psp_ptl_invoke() to better trace PTL state and format queries/updates cmd. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

…eout_ms Add a new module parameter `gtt_lock_timeout_ms` (default 0 = unbounded, byte-identical to the current `mutex_lock()` behaviour) that caps the wait time for `adev->mman.gtt_window_lock` in the three amdgpu_ttm.c paths that take it. At the deadline the helper returns -ETIME so the caller fails fast instead of parking indefinitely on a wedged SDMA ring. ## Motivation In multi-tenant HIP/AMDGPU serving workloads, a single wedged SDMA ring can park every caller of `amdgpu_ttm_copy_mem_to_mem()`, `amdgpu_ttm_clear_buffer()`, and `amdgpu_fill_buffer()` on the global `mman.gtt_window_lock` mutex for minutes-plus while a single in-flight copy waits for the ring. The hold time is bounded only by the SDMA recovery path (which itself may need tens of seconds). The legacy unbounded `mutex_lock()` then converts a localised SDMA hang into a system-wide GTT-window stall affecting every VRAM-touching ioctl from every tenant on the device. `gtt_lock_timeout_ms` lets operators opt the GTT window contention path into a bounded-wait failure mode (return -ETIME) so the higher- level survival policy can choose whether to retry, re-queue, or surface the error to the application. ## Technical Details - drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c Declare `int amdgpu_gtt_lock_timeout_ms;` + `module_param_named()` + `MODULE_PARM_DESC()` + DOC comment. Default 0 = unbounded. - drivers/gpu/drm/amd/amdgpu/amdgpu.h Add `extern int amdgpu_gtt_lock_timeout_ms;`. - drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c Add `static int amdgpu_ttm_lock_gtt_window(struct amdgpu_device *adev)` helper. When `amdgpu_gtt_lock_timeout_ms == 0`, the helper degrades to a plain `mutex_lock(&adev->mman.gtt_window_lock)` and returns 0 (byte-identical to current behaviour). Otherwise it does a `mutex_trylock()` loop with `schedule_timeout_interruptible(1ms)` bounded by the configured deadline, returning -ETIME on expiry or -ERESTARTSYS on pending signal. Values below 100 ms are clamped up to 100 ms to avoid a fail-only path. Three callsites switch from `mutex_lock(&...gtt_window_lock)` to `r = amdgpu_ttm_lock_gtt_window(adev); if (r) return r;`: - `amdgpu_ttm_copy_mem_to_mem()` (SDMA buffer<->buffer copy) - `amdgpu_ttm_clear_buffer()` (SDMA buffer clear/wipe) - `amdgpu_fill_buffer()` (SDMA buffer fill) `mutex_unlock()` at each function's error label is unchanged. ## JIRA ID N/A ## Test Plan Reproducer: multi-tenant SDMA-bound workload with one wedged ring plus concurrent VRAM-touching ioctls from co-tenants. Open-source reproducer to be added in a follow-up test patch. Expected behaviour: - `gtt_lock_timeout_ms` unset / 0: stock behaviour; all existing tests pass byte-identically (helper degrades to `mutex_lock()`). - `gtt_lock_timeout_ms=4000`: under multi-tenant SDMA load, the worst-case wait on the lock is ~4 seconds (matching the configured deadline), with -ETIME propagated back to the caller cleanly. ## Test Result - All existing `drm-tip` self-tests pass at `gtt_lock_timeout_ms=0` (default). No behavioural change for stock callers. - Multi-tenant SDMA reproducer confirms bounded ~4 s tail latency on the GTT window-lock contention path with `gtt_lock_timeout_ms=4000` vs unbounded minutes pre-patch. Reviewers may suggest using `mutex_lock_killable_timeout()` if preferred over the explicit trylock-and-sleep loop; the current form keeps the dependency surface minimal (no kernel API additions). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: chun-wan <chun-wan@users.noreply.github.com>

superm1 and others added 30 commits October 28, 2025 10:49

drm/amdkcl: test whether pm_hibernation_mode_is_suspend() is available

59db155

It's caused by the commit: 1fcbc0b6a8 "drm/amd: Fix hybrid sleep" Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com> Reviewed-by: Bob Zhou <Bob.Zhou@amd.com>

Merge amd-staging-dkms-6.16 into amd-mainline-dkms-6.16

d8c009c

Signed-off-by: Yang Su <Yang.Su2@amd.com>

Bump AMDGPU version to 6.16.9

47a0d3c

Signed-off-by: Yang Su <Yang.Su2@amd.com>

drm/amdgpu/pm: Add definition for gpu_metrics v1.9

478f84b

Add gpu metrics definition which is only a set of gpu metrics attributes. A field is encoded by its id, type and number of instances. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

drm/amdgpu: move reset debug disable handling

b96d877

Move everything to the supported resets masks rather than having an explicit misc checks for this. Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Revert "drm/amd/display: Add common modes to analog displays without …

f027f44

…EDID" This reverts commit 11b66b2. It's caused the Jira ticket: SWDEV-563655, so revert it temporarily. Signed-off-by: Chengjun Yao <Chengjun.Yao@amd.com>

drm/amdgpu: Fix error injection parameter error

1e10f8b

Fix error injection parameter error. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Merge amd-staging-dkms-6.16 into amd-mainline-dkms-6.16

abf04b2

Signed-off-by: Yang Su <Yang.Su2@amd.com>

Bump AMDGPU version to 6.16.10

0b5f7c9

Signed-off-by: Yang Su <Yang.Su2@amd.com>

drm/amd/pm: Use gpu metrics 1.9 for SMUv13.0.6

6115a3a

Fill and publish GPU metrics in v1.9 format for SMUv13.0.6 SOCs Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com>

Jie1zhang and others added 29 commits February 4, 2026 21:03

drm/amdgpu: Avoid excessive dmesg log

55e4164

KIQ access is not guaranteed to work reliably under all reset situations. Avoid flooding dmesg with HDP flush failure messages. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com>

Revert "drm/amdgpu: Wait for GFX idle before PTL state transition"

ff677bf

This reverts commit c61fab0. Reason for revert: revert the patch as it causes performance drop,tested by CE team. revert this patch requested by management team. Change-Id: I0db3f9f819554566e259bbb1292e7690db958ced

drm/amdgpu: add new data types F8 and Vector for PTL

d5a51b9

Add F8 and VECTOR to amdgpu_ptl_fmt and PSP format mapping. Update PTL format strings and GFX format enum to keep PSP/KFD in sync. Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

Revert "drm/amdkfd: kfd driver supports hot unplug/replug amdgpu devi…

28f5e8a

…ces" This reverts commit 1562589. Reason for revert: <this patch breaks dkms install> Change-Id: If54b6c5f703eb136db61770d5ddafcba22bf4620

drm/amdgpu: create PTL sysfs after XGMI reset-on-init restore

13f9e43

Create PTL sysfs in xgmi_reset_on_init restore path for MINIMAL_XGMI Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

drm/amdkfd: fix unhalt_cpsch warning during module unload

5914d95

Downgrade unhalt_cpsch warning to dev_dbg when sched is already stopped Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

drm/amdgpu: only set PTL SYSFS disable bit when PTL is disabled

fa0bd36

Only set the bit when PTL is actually being disabled (state=0) Signed-off-by: Perry Yuan <perry.yuan@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>

drm/amdgpu: return when ras table checksum is error

dc79b34

end the function flow when ras table checksum is error Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com>

drm/amdgpu: compatible with specific RAS old eeprom format

ef22bd6

Handle RAS eeprom record when UMC_CHANNEL_IDX_V2 is set. v2: get UMC_CHANNEL_IDX_V2 flag before the clear of it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

drm/amdgpu: clear related counter after RAS eeprom reset

1c85654

Make eeprom data and its counter consistent. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

chun-wan force-pushed the topic/drm-amdgpu-gtt-lock-timeout-ms branch from f0a2d8f to f4b93d8 Compare May 20, 2026 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic/drm amdgpu gtt lock timeout ms#215

Topic/drm amdgpu gtt lock timeout ms#215
chun-wan wants to merge 2668 commits into
ROCm:masterfrom
chun-wan:topic/drm-amdgpu-gtt-lock-timeout-ms

chun-wan commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

chun-wan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

JIRA ID

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

chun-wan commented May 20, 2026 •

edited

Loading