Ensure is_done is set for unmanaged commands by richard-byron · Pull Request #9719 · Xilinx/XRT

richard-byron · 2026-04-08T22:00:02Z

Problem solved by the commit

Fixes hitting assert in run_impl destructor when unmanaged have actually finished but their state has not been updated
Resolves potential deadlock by removing lock from is_done. The lock here should not be necessary, and is_done is called from set_dtrace_control_file while it already holds the mutex
Fixes potential race conditions with m_callbacks being accessed outside of locks, which could theoretically results in callbacks being called multiple times, or not at all.

Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered

Introduced in #9649

How problem was solved, alternative solutions (if any) and why they were rejected

There are simpler solutions. Just calling get_state() before is_done() works, as long as you can convince the compiler it actually needs to update m_done. That would leave the other issues unresolved, but I'm not aware of them actually causing problems.

Risks (if any) associated the changes in the commit

Might change behavior with multiple threads waiting on runs to complete.

What has been tested and how, request additional testing if necessary

Tested on several designs, including ones that were hitting the assert and ones that weren't. More testing would definitely be desirable.

xrt-pr-bot · 2026-04-08T22:00:22Z

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

XRT Admins: Add the build label to authorize this PR build
OR Add @richard-byron as a repository collaborator

xrt-pr-bot · 2026-04-13T19:16:46Z

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

XRT Admins: Add the build label to authorize this PR build
OR Add @richard-byron as a repository collaborator

xrt-pr-bot · 2026-04-13T19:34:57Z

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

XRT Admins: Add the build label to authorize this PR build
OR Add @richard-byron as a repository collaborator

xrt-pr-bot · 2026-04-13T19:42:10Z

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

XRT Admins: Add the build label to authorize this PR build
OR Add @richard-byron as a repository collaborator

stsoe

Hi @richard-byron. Thanks for the suggested changes.

If run_impl dtor complains, then that is because wait2() was called as expected. The execution model requires wait to be called, but the dtor could potentially call existing get_state() to update m_done before complaining or aborting.

I see the deadlock you mention, thanks!, but that should be fixed without the changes in this PR.

I also see the potential race in m_callbacks, but that I think is only one place missing a lock, e.g. pop_callback()?

I am not exited about these changes because I can think in particular get_state() results in potential problems when it no longer calls notify(). I added few specific comments.

I would like if we could fix just the deadlock, and the unnecessary pointer to callback vector. I do not want to work-around missing calls to functions that update m_done, (wait2() or get_state()).

stsoe · 2026-04-13T20:01:53Z

  {
-    std::lock_guard<std::mutex> lk(m_mutex);
+    if (!m_done && !m_managed) {
+      m_done = get_state() >= ERT_CMD_STATE_COMPLETED;


I am not comfortable updating m_done outside the lock. I could be in process of being update by run()

stsoe · 2026-04-13T20:03:23Z

-    auto state = get_state_raw();
-    notify(state);  // update command state accordingly
-    return state;
+    return get_state_raw();


This is behavior change, if get_state() reflects completed, m_done can still be false.

stsoe · 2026-04-13T20:04:00Z

+    // lock must not be held while calling callback function
    if (complete)
-      m_callbacks.get()->back()(state);
+      fcn(state);


Calling moved fcn ?

stsoe · 2026-04-13T20:19:25Z

 private:
+  struct callback_properties {
+    callback_function_type  fn;
+    mutable bool            called;


I am not following why called is necessary?

stsoe · 2026-04-13T20:25:46Z

+      // cannot lock mutex while calling the callbacks
+      // so copy address of callbacks while holding the lock
+      // then execute callbacks without lock
+      if (!m_callbacks.empty()) {
+        copy.reserve(m_callbacks.size());
+        for (auto& cb : m_callbacks) {
+          if (!cb.called) {
+            cb.called = true;
+            copy.emplace_back(cb.fn);
+          }


I don't see what was wrong with earlier logic, nor do I see the need for called?

stsoe · 2026-04-13T20:26:06Z

  mutable std::condition_variable m_exec_done;

-  std::unique_ptr<callback_list> m_callbacks;
+  callback_list m_callbacks; // don't see any reason this was a pointer?


richard-byron · 2026-04-13T21:45:03Z

I think the minimal working change is:

remove the lock in is_done() (without the other change here. The read of m_done is atomic anyway, so this is not required and avoids the deadlock)
set m_done in wait() when m_managed is false (as in the PR currently)

I wasn't aware wait was required to be called, or I probably would have just done that

You could just add get_state() in the destructor instead, but that was getting optimized out when I tried it, unless I printed the return value.

stsoe · 2026-04-13T22:14:02Z

I think the minimal working change is:

remove the lock in is_done() (without the other change here. The read of m_done is atomic anyway, so this is not required and avoids the deadlock)

m_done is not atomic, maybe it should be. As is, it should not be updated outside the lock.

set m_done in wait() when m_managed is false (as in the PR currently)

You mean as the PR does in is_done() ? With m_done changed to atomic and with the PR change in is_done(), the call to wait prior to run_impl dtor becomes optional. This sounds okay to me.

richard-byron · 2026-04-13T22:53:56Z

What I mean when I say m_done is atomic is that the read from memory should already be atomic (unless it was using 64 bits for a bool on a 32 bit processor, or something crazy like that). The lock doesn't actually do anything.
I don't think m_done needs to be atomic, but it probably makes things clearer if it is.

is_done() could then become:
m_done.compare_exchange_strong(false, get_state_raw() >= ERT_CMD_STATE_COMPLETED); return m_done;
but my question with that is, do we need to call m_hwqueue.poll() first? In that case get_state() seems to be the better call, but we probably don't want is_done() triggering a call to notify() and potentially firing off callbacks?

I did also think about is_done(bool updateState = false) and having the destructor call is_done(true); but wasn't sure I liked that either.

As long as it's ok to require wait to be called, I think the cleanest solution is just to add:
m_done.compare_exchange_strong(false, get_state_raw() >= ERT_CMD_STATE_COMPLETED);

to the end of both kernel_command wait() functions (and remove the lock from is_done).

stsoe · 2026-04-13T23:07:00Z

Without lock (or atomic) how does one guarantee memory ordering or that writers change become visible to readers?

yeah, poll and notify need to be called. I don't want this over engineered. Let's change m_done to atomic. Let is_done call get_state() with no changes, such that notify is called also without any changes. Something like this (and atomic) being the only change.

  bool
  is_done() const
  {
    if (!m_done && !m_managed)
      return (get_state() >= ERT_CMD_STATE_COMPLETED);

    return m_done;
  }

xrt-pr-bot · 2026-04-15T16:54:20Z

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

XRT Admins: Add the build label to authorize this PR build
OR Add @richard-byron as a repository collaborator

richard-byron

The problem with calling get_state() inside is_done() is that we might end up trying to take the lock twice. Avoiding that requires making additional changes, but I think this implementation avoids all threading issues without changing behavior.

stsoe

The execution model requires that applications call xrt::run::wait2() to ensure completion is reflected in the state of the run implementation. I feel there are too many unrelated changes in this PR, maybe they work, but I cannot easily understand why they are necessary.

There is a fix required to resolve the dtrace logging deadlock you brought to our attention. We will ticket this and resolve differently.

richard-byron · 2026-04-15T17:18:05Z

Calling wait2 does not update m_done, so the assert in the destructor can still fire spuriously. If calling wait2 is a requirement though, then you're correct that there is a simpler fix - just update m_done in wait/wait2.

stsoe · 2026-04-15T18:02:55Z

Calling wait2 does not update m_done, so the assert in the destructor can still fire spuriously. If calling wait2 is a requirement though, then you're correct that there is a simpler fix - just update m_done in wait/wait2.

It's rather convoluted, I agree. But wait2() definitely ends up reflecting completion with m_done set to true. If a command is "managed" then notify() is called with state change when command monitor observes the command completing. If command is "unmanaged" then m_hwqueue.wait() calls notify(). In both cases m_done reflects completion.

richard-byron · 2026-04-15T19:44:14Z

I'm fairly certain m_done was not being set for "unmanaged" commands that had actually completed, even after waiting. The graph I was testing certainly had issues though - so possibly there was some error causing that. Happy to just close this in any case.

richard-byron requested a review from stsoe as a code owner April 8, 2026 22:00

richard-byron force-pushed the master branch from 7738ea1 to c7b4e8d Compare April 13, 2026 19:33

Ensure is_done is set for unmanaged commands

a32f443

richard-byron force-pushed the master branch from c7b4e8d to a32f443 Compare April 13, 2026 19:41

stsoe requested changes Apr 13, 2026

View reviewed changes

Minimal changes to ensure is_done reports command status correctly

d3ea5bb

richard-byron commented Apr 15, 2026

View reviewed changes

stsoe reviewed Apr 15, 2026

View reviewed changes

richard-byron closed this Apr 15, 2026

Conversation

richard-byron commented Apr 8, 2026

Problem solved by the commit

Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered

How problem was solved, alternative solutions (if any) and why they were rejected

Risks (if any) associated the changes in the commit

What has been tested and how, request additional testing if necessary

Uh oh!

xrt-pr-bot Bot commented Apr 8, 2026

Uh oh!

xrt-pr-bot Bot commented Apr 13, 2026

Uh oh!

xrt-pr-bot Bot commented Apr 13, 2026

Uh oh!

xrt-pr-bot Bot commented Apr 13, 2026

Uh oh!

stsoe left a comment

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

stsoe Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

richard-byron commented Apr 13, 2026

Uh oh!

stsoe commented Apr 13, 2026

Uh oh!

richard-byron commented Apr 13, 2026

Uh oh!

stsoe commented Apr 13, 2026

Uh oh!

xrt-pr-bot Bot commented Apr 15, 2026

Uh oh!

richard-byron left a comment

Choose a reason for hiding this comment

Uh oh!

stsoe left a comment

Choose a reason for hiding this comment

Uh oh!

richard-byron commented Apr 15, 2026

Uh oh!

stsoe commented Apr 15, 2026

Uh oh!

richard-byron commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants