Ensure is_done is set for unmanaged commands#9719
Ensure is_done is set for unmanaged commands#9719richard-byron wants to merge 2 commits intoXilinx:masterfrom
Conversation
|
@richard-byron is not a repository collaborator. To proceed:
|
1 similar comment
|
@richard-byron is not a repository collaborator. To proceed:
|
|
@richard-byron is not a repository collaborator. To proceed:
|
|
@richard-byron is not a repository collaborator. To proceed:
|
stsoe
left a comment
There was a problem hiding this comment.
Hi @richard-byron. Thanks for the suggested changes.
If run_impl dtor complains, then that is because wait2() was called as expected. The execution model requires wait to be called, but the dtor could potentially call existing get_state() to update m_done before complaining or aborting.
I see the deadlock you mention, thanks!, but that should be fixed without the changes in this PR.
I also see the potential race in m_callbacks, but that I think is only one place missing a lock, e.g. pop_callback()?
I am not exited about these changes because I can think in particular get_state() results in potential problems when it no longer calls notify(). I added few specific comments.
I would like if we could fix just the deadlock, and the unnecessary pointer to callback vector. I do not want to work-around missing calls to functions that update m_done, (wait2() or get_state()).
| { | ||
| std::lock_guard<std::mutex> lk(m_mutex); | ||
| if (!m_done && !m_managed) { | ||
| m_done = get_state() >= ERT_CMD_STATE_COMPLETED; |
There was a problem hiding this comment.
I am not comfortable updating m_done outside the lock. I could be in process of being update by run()
| auto state = get_state_raw(); | ||
| notify(state); // update command state accordingly | ||
| return state; | ||
| return get_state_raw(); |
There was a problem hiding this comment.
This is behavior change, if get_state() reflects completed, m_done can still be false.
| // lock must not be held while calling callback function | ||
| if (complete) | ||
| m_callbacks.get()->back()(state); | ||
| fcn(state); |
| private: | ||
| struct callback_properties { | ||
| callback_function_type fn; | ||
| mutable bool called; |
There was a problem hiding this comment.
I am not following why called is necessary?
| // cannot lock mutex while calling the callbacks | ||
| // so copy address of callbacks while holding the lock | ||
| // then execute callbacks without lock | ||
| if (!m_callbacks.empty()) { | ||
| copy.reserve(m_callbacks.size()); | ||
| for (auto& cb : m_callbacks) { | ||
| if (!cb.called) { | ||
| cb.called = true; | ||
| copy.emplace_back(cb.fn); | ||
| } |
There was a problem hiding this comment.
I don't see what was wrong with earlier logic, nor do I see the need for called?
| mutable std::condition_variable m_exec_done; | ||
|
|
||
| std::unique_ptr<callback_list> m_callbacks; | ||
| callback_list m_callbacks; // don't see any reason this was a pointer? |
|
I think the minimal working change is:
I wasn't aware wait was required to be called, or I probably would have just done that You could just add get_state() in the destructor instead, but that was getting optimized out when I tried it, unless I printed the return value. |
You mean as the PR does in |
|
What I mean when I say m_done is atomic is that the read from memory should already be atomic (unless it was using 64 bits for a bool on a 32 bit processor, or something crazy like that). The lock doesn't actually do anything. is_done() could then become: I did also think about As long as it's ok to require wait to be called, I think the cleanest solution is just to add: to the end of both kernel_command wait() functions (and remove the lock from is_done). |
|
Without lock (or atomic) how does one guarantee memory ordering or that writers change become visible to readers? yeah, poll and notify need to be called. I don't want this over engineered. Let's change m_done to atomic. Let is_done call get_state() with no changes, such that notify is called also without any changes. Something like this (and atomic) being the only change. |
|
@richard-byron is not a repository collaborator. To proceed:
|
richard-byron
left a comment
There was a problem hiding this comment.
The problem with calling get_state() inside is_done() is that we might end up trying to take the lock twice. Avoiding that requires making additional changes, but I think this implementation avoids all threading issues without changing behavior.
stsoe
left a comment
There was a problem hiding this comment.
The execution model requires that applications call xrt::run::wait2() to ensure completion is reflected in the state of the run implementation. I feel there are too many unrelated changes in this PR, maybe they work, but I cannot easily understand why they are necessary.
There is a fix required to resolve the dtrace logging deadlock you brought to our attention. We will ticket this and resolve differently.
|
Calling wait2 does not update m_done, so the assert in the destructor can still fire spuriously. If calling wait2 is a requirement though, then you're correct that there is a simpler fix - just update m_done in wait/wait2. |
It's rather convoluted, I agree. But wait2() definitely ends up reflecting completion with m_done set to true. If a command is "managed" then notify() is called with state change when command monitor observes the command completing. If command is "unmanaged" then m_hwqueue.wait() calls notify(). In both cases m_done reflects completion. |
|
I'm fairly certain m_done was not being set for "unmanaged" commands that had actually completed, even after waiting. The graph I was testing certainly had issues though - so possibly there was some error causing that. Happy to just close this in any case. |
Problem solved by the commit
Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered
Introduced in #9649
How problem was solved, alternative solutions (if any) and why they were rejected
There are simpler solutions. Just calling get_state() before is_done() works, as long as you can convince the compiler it actually needs to update m_done. That would leave the other issues unresolved, but I'm not aware of them actually causing problems.
Risks (if any) associated the changes in the commit
Might change behavior with multiple threads waiting on runs to complete.
What has been tested and how, request additional testing if necessary
Tested on several designs, including ones that were hitting the assert and ones that weren't. More testing would definitely be desirable.