report hung radostrace ops on interrupt#92
Conversation
|
test result - https://paste.ubuntu.com/p/FvpMC8XsCX/ |
| clog << "process killed" << endl; | ||
| got_sigint = 1; | ||
| } else { | ||
| exit(signum); |
There was a problem hiding this comment.
exit(3) isn't a async-signal-safe function: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html
better to use _exit() instead.
The same concern is applicable for clog too. We need to review "signal safety" across the codease. But that can be done separately later.
| int found = 0; | ||
|
|
||
| memset(&key, 0, sizeof(key)); | ||
| while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) { |
There was a problem hiding this comment.
To get the first key that's not in the map, bpf_map_get_next_key needs to be called with NULL as key. But a valid "key" is passed now (key is zero-initalized but it's not NULL), so this is likely to miss the correct first key.
struct client_op_k *cur = NULL;
while (bpf_map_get_next_key(map_fd, cur, &next_key) == 0) {
if (bpf_map_lookup_elem(map_fd, &next_key, &val) == 0) {
...
}
key = next_key;
cur = &key;
}
would fix it.
|
I also think "hung ops" better describes than "hang ops", so suggest name changes as such. |
aa20743 to
b05de17
Compare
taodd
left a comment
There was a problem hiding this comment.
We also need to add a test to deliberately cause a hang Op
| } | ||
| if (name_len > 0) { | ||
| bpf_probe_read_user(val->object_name, name_len, (void *)name_base); | ||
| } |
b05de17 to
762503a
Compare
|
test result in microceph - https://paste.ubuntu.com/p/9NqMgrcz4K/ |
|
Thanks @zhhuabj
|
|
Also, the patch needs to be rebased |
762503a to
f67ffe2
Compare
Hi @dongdong, Thank you for the feedback! I've addressed all three suggestions: 1, pid from send_op: Moved val->pid = get_pid() to uprobe_send_op so incomplete ops also have a valid pid. and here is the latest test result - https://paste.ubuntu.com/p/3sSGqb3WGv/ |
Previously, radostrace silently discarded any ops that were in-flight (tracked in the BPF ops map via _send_op) but had not yet completed (_finish_op not yet fired) when the session ended. This made it impossible for users to identify hung or slow operations when interrupting a trace or when a timeout expired. This commit adds a report_hung_ops() function that iterates the BPF ops hash map at exit and prints any remaining entries as "incomplete" rows, using the same tabular format as the normal "complete" output. Signed-off-by: Zhang Hua <joshua.zhang@canonical.com>
f67ffe2 to
cf846e9
Compare
Handle SIGINT without exiting immediately so radostrace can dump operations still pending in the BPF ops map before cleanup.