Lazy Heartbeat Scheduling#44
Conversation
|
Here are my results. My system isn't particularly quiet, so take it with a grain of salt. https://gist.github.com/CorvusPrudens/16c4f6dc26770c8578a869978eb85ab8 |
|
There seems to be a potential segfault within |
| if sleeping == 0 { | ||
| return; | ||
| } | ||
| cold_path(); |
There was a problem hiding this comment.
How odd, this cold_path hint seems to be the source of the segfault...
There was a problem hiding this comment.
removing that cold_path() still leads to "exit code: 0xc0000005, STATUS_ACCESS_VIOLATION" on my machine (Win11 on a Ryzen 7600x) once the fork_join bench hits forte
Enabling any LTO (fat or thin) fixes the issue
There was a problem hiding this comment.
There's some UB somewhere in the promotion code which I am still tracking down.
There was a problem hiding this comment.
I looked through your code and couldn't find anything for the life of me
So I chucked some Codex GPT5.5-High at it while doing the dishes and it seems to have found an upstream issue in tick_counter
They do:
/// Returns a current value of the tick counter to use as a staring point
#[cfg(target_arch = "x86_64")]
#[inline]
pub fn start() -> u64 {
let rax: u64;
unsafe {
asm!(
"mfence",
"lfence",
"rdtsc",
"shl rdx, 32",
"or rax, rdx",
out("rax") rax
);
}
rax
}in https://github.com/sheroz/tick_counter/blob/7511cf114ca536f94c35b675b5a12d5b9a762e96/src/lib.rs#L143
rdx gets clobbered but is not declared as a clobber or an out despite being left-shifted within this block. However Rust assumes that nothing but the declared output changes within an asm! block. Therefore the program might overwrite some existing value in rdx and run into UB trying to use it in other places
There was a problem hiding this comment.
Interesting! This would be consistent with my observations, the issues started when tick-counter was introduced.
I'll report this upstream and try switching to cputicks. I'd really rather not have to maintain my own inline asm for this.
Thanks for investigating!!
There was a problem hiding this comment.
Ah, this seems to have already been reported. sheroz/tick_counter#15 (comment)
What fun, I really should have audited that dep more closely.
|
Edit: Unified comments and added more Benches Ran the bench with the tick_counter issue fixed locally: Default: Default + mimalloc: My Fix: Replaced tick_counter usage with the following code: Old Benches:Gist of the bench as is: Gist using thin LTO to avoid STATUS_ACCESS_VIOLATION on fork_join bench: Gist using fat LTO + misc build flags on nightly (see notes inside gist for details): Gist using config above + enabling -Cpanic=abort for benches: Got curious about the impact of some misc. build flags so I chucked them in there The panic=abort results are quite interesting. Some select benchmarks show significant improvements while others (most) barely changed. And on fork_join - throughput_forte there is a significant speedup on smaller workloads but a regression from (11, 2047) onwards. Also ran with mimalloc out of curiosity: Gist of the bench as is + mimalloc: Gist using thin LTO + mimalloc: Gist using nightly + misc flags + panic=abort + mimalloc: |
|
Switched to This PR is mostly waiting for a |
|
Reran cargo bench under Win11_x64 with latest stable on two different CPUs Bench with Ryzen 5 7600x Bench with Intel i5-9600k No real surprises, except that Rayon perf degraded significantly worse relative to the other crates when running fork_join (20, 1048575) on the older cache-poor CPU |
|
Sick! I'm reasonably happy with those results. Chili is still beating us on x86 windows, and especially on resource-constrained x86 windows, but we are mostly matching rayon. There's probably a few more things we may be able to steal from chili (and enable conditionally on windows) which could close that gap. I'm going to merge this, I think, and then I'll put out the next beta release as soon as the git-dep is no longer necessary. |
At the start of the year, I started circulating this crate about and soliciting feedback. The general response was: it doesn't perform well. Womp womp. Sad trumpet noises.
Over the last few months, I tried built several prototypes aimed at improving performance. This PR is the culmination of those efforts. Some of of the intermediate experiments got pretty pretty messy, so for simplicity I've chosen to just pull the working experiments together into this Mega-PR.
There are three major changes:
I've added a new lock-free worker registration and sleep-tracking system (Basically we use an atomic bit-set for tracking seat information). This has significantly improved the overhead of worker wakeups, and allows us to wake workers up much more eagerly, but also means thread pools now have a hard limit of 32 members. I think this is a reasonable constraint (and we can theoretically expand it up to 64 on platforms that support 64-bit atomics).
There's a totally new (potentially novel) approach to workload balancing that combines work-stealing, heartbeat scheduling, and lazy scheduling. This is detailed in the new readme, so I won't repeat it here.
I've also done another comprehensive pass on docs and safety comments.
This code will form the basis for
1.0.0-alpha.5and (depending on how the next round of feedback goes) may end up being released as1.0. Hopefully changes will be more incremental going forwards.