Skip to content

Ensure is_done is set for unmanaged commands#9719

Closed
richard-byron wants to merge 2 commits intoXilinx:masterfrom
richard-byron:master
Closed

Ensure is_done is set for unmanaged commands#9719
richard-byron wants to merge 2 commits intoXilinx:masterfrom
richard-byron:master

Conversation

@richard-byron
Copy link
Copy Markdown

Problem solved by the commit

  • Fixes hitting assert in run_impl destructor when unmanaged have actually finished but their state has not been updated
  • Resolves potential deadlock by removing lock from is_done. The lock here should not be necessary, and is_done is called from set_dtrace_control_file while it already holds the mutex
  • Fixes potential race conditions with m_callbacks being accessed outside of locks, which could theoretically results in callbacks being called multiple times, or not at all.

Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered

Introduced in #9649

How problem was solved, alternative solutions (if any) and why they were rejected

There are simpler solutions. Just calling get_state() before is_done() works, as long as you can convince the compiler it actually needs to update m_done. That would leave the other issues unresolved, but I'm not aware of them actually causing problems.

Risks (if any) associated the changes in the commit

Might change behavior with multiple threads waiting on runs to complete.

What has been tested and how, request additional testing if necessary

Tested on several designs, including ones that were hitting the assert and ones that weren't. More testing would definitely be desirable.

@richard-byron richard-byron requested a review from stsoe as a code owner April 8, 2026 22:00
@xrt-pr-bot
Copy link
Copy Markdown

xrt-pr-bot Bot commented Apr 8, 2026

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

  • XRT Admins: Add the build label to authorize this PR build
  • OR Add @richard-byron as a repository collaborator

1 similar comment
@xrt-pr-bot
Copy link
Copy Markdown

xrt-pr-bot Bot commented Apr 13, 2026

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

  • XRT Admins: Add the build label to authorize this PR build
  • OR Add @richard-byron as a repository collaborator

@xrt-pr-bot
Copy link
Copy Markdown

xrt-pr-bot Bot commented Apr 13, 2026

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

  • XRT Admins: Add the build label to authorize this PR build
  • OR Add @richard-byron as a repository collaborator

@xrt-pr-bot
Copy link
Copy Markdown

xrt-pr-bot Bot commented Apr 13, 2026

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

  • XRT Admins: Add the build label to authorize this PR build
  • OR Add @richard-byron as a repository collaborator

Copy link
Copy Markdown
Collaborator

@stsoe stsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @richard-byron. Thanks for the suggested changes.

If run_impl dtor complains, then that is because wait2() was called as expected. The execution model requires wait to be called, but the dtor could potentially call existing get_state() to update m_done before complaining or aborting.

I see the deadlock you mention, thanks!, but that should be fixed without the changes in this PR.

I also see the potential race in m_callbacks, but that I think is only one place missing a lock, e.g. pop_callback()?

I am not exited about these changes because I can think in particular get_state() results in potential problems when it no longer calls notify(). I added few specific comments.

I would like if we could fix just the deadlock, and the unnecessary pointer to callback vector. I do not want to work-around missing calls to functions that update m_done, (wait2() or get_state()).

{
std::lock_guard<std::mutex> lk(m_mutex);
if (!m_done && !m_managed) {
m_done = get_state() >= ERT_CMD_STATE_COMPLETED;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not comfortable updating m_done outside the lock. I could be in process of being update by run()

auto state = get_state_raw();
notify(state); // update command state accordingly
return state;
return get_state_raw();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is behavior change, if get_state() reflects completed, m_done can still be false.

// lock must not be held while calling callback function
if (complete)
m_callbacks.get()->back()(state);
fcn(state);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling moved fcn ?

private:
struct callback_properties {
callback_function_type fn;
mutable bool called;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following why called is necessary?

Comment on lines +958 to +967
// cannot lock mutex while calling the callbacks
// so copy address of callbacks while holding the lock
// then execute callbacks without lock
if (!m_callbacks.empty()) {
copy.reserve(m_callbacks.size());
for (auto& cb : m_callbacks) {
if (!cb.called) {
cb.called = true;
copy.emplace_back(cb.fn);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what was wrong with earlier logic, nor do I see the need for called?

mutable std::condition_variable m_exec_done;

std::unique_ptr<callback_list> m_callbacks;
callback_list m_callbacks; // don't see any reason this was a pointer?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree !

@richard-byron
Copy link
Copy Markdown
Author

I think the minimal working change is:

  • remove the lock in is_done() (without the other change here. The read of m_done is atomic anyway, so this is not required and avoids the deadlock)
  • set m_done in wait() when m_managed is false (as in the PR currently)

I wasn't aware wait was required to be called, or I probably would have just done that

You could just add get_state() in the destructor instead, but that was getting optimized out when I tried it, unless I printed the return value.

@stsoe
Copy link
Copy Markdown
Collaborator

stsoe commented Apr 13, 2026

I think the minimal working change is:

  • remove the lock in is_done() (without the other change here. The read of m_done is atomic anyway, so this is not required and avoids the deadlock)

m_done is not atomic, maybe it should be. As is, it should not be updated outside the lock.

  • set m_done in wait() when m_managed is false (as in the PR currently)

You mean as the PR does in is_done() ? With m_done changed to atomic and with the PR change in is_done(), the call to wait prior to run_impl dtor becomes optional. This sounds okay to me.

@richard-byron
Copy link
Copy Markdown
Author

What I mean when I say m_done is atomic is that the read from memory should already be atomic (unless it was using 64 bits for a bool on a 32 bit processor, or something crazy like that). The lock doesn't actually do anything.
I don't think m_done needs to be atomic, but it probably makes things clearer if it is.

is_done() could then become:
m_done.compare_exchange_strong(false, get_state_raw() >= ERT_CMD_STATE_COMPLETED); return m_done;
but my question with that is, do we need to call m_hwqueue.poll() first? In that case get_state() seems to be the better call, but we probably don't want is_done() triggering a call to notify() and potentially firing off callbacks?

I did also think about is_done(bool updateState = false) and having the destructor call is_done(true); but wasn't sure I liked that either.

As long as it's ok to require wait to be called, I think the cleanest solution is just to add:
m_done.compare_exchange_strong(false, get_state_raw() >= ERT_CMD_STATE_COMPLETED);

to the end of both kernel_command wait() functions (and remove the lock from is_done).

@stsoe
Copy link
Copy Markdown
Collaborator

stsoe commented Apr 13, 2026

Without lock (or atomic) how does one guarantee memory ordering or that writers change become visible to readers?

yeah, poll and notify need to be called. I don't want this over engineered. Let's change m_done to atomic. Let is_done call get_state() with no changes, such that notify is called also without any changes. Something like this (and atomic) being the only change.

  bool
  is_done() const
  {
    if (!m_done && !m_managed)
      return (get_state() >= ERT_CMD_STATE_COMPLETED);

    return m_done;
  }

@xrt-pr-bot
Copy link
Copy Markdown

xrt-pr-bot Bot commented Apr 15, 2026

⚠️ Authorization Failed

@richard-byron is not a repository collaborator.

To proceed:

  • XRT Admins: Add the build label to authorize this PR build
  • OR Add @richard-byron as a repository collaborator

Copy link
Copy Markdown
Author

@richard-byron richard-byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with calling get_state() inside is_done() is that we might end up trying to take the lock twice. Avoiding that requires making additional changes, but I think this implementation avoids all threading issues without changing behavior.

Copy link
Copy Markdown
Collaborator

@stsoe stsoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution model requires that applications call xrt::run::wait2() to ensure completion is reflected in the state of the run implementation. I feel there are too many unrelated changes in this PR, maybe they work, but I cannot easily understand why they are necessary.

There is a fix required to resolve the dtrace logging deadlock you brought to our attention. We will ticket this and resolve differently.

@richard-byron
Copy link
Copy Markdown
Author

Calling wait2 does not update m_done, so the assert in the destructor can still fire spuriously. If calling wait2 is a requirement though, then you're correct that there is a simpler fix - just update m_done in wait/wait2.

@stsoe
Copy link
Copy Markdown
Collaborator

stsoe commented Apr 15, 2026

Calling wait2 does not update m_done, so the assert in the destructor can still fire spuriously. If calling wait2 is a requirement though, then you're correct that there is a simpler fix - just update m_done in wait/wait2.

It's rather convoluted, I agree. But wait2() definitely ends up reflecting completion with m_done set to true. If a command is "managed" then notify() is called with state change when command monitor observes the command completing. If command is "unmanaged" then m_hwqueue.wait() calls notify(). In both cases m_done reflects completion.

@richard-byron
Copy link
Copy Markdown
Author

I'm fairly certain m_done was not being set for "unmanaged" commands that had actually completed, even after waiting. The graph I was testing certainly had issues though - so possibly there was some error causing that. Happy to just close this in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants