Reduce audible startup sync correction#255
Conversation
|
We should use same value as sendspin cpp |
|
I took a look at sendspin-cpp. Pardon if this is known already but I figured I would share my findings. The relevant functionality is around sync_task.cpp:236. The implementation seems fairly different around how the stream is synchronized at startup and during playback; beyond 5000us initially or 500us during playback, it will "hard sync" by dropping large blocks of frames, and below that level it will "soft sync" by adding or removing up to one frame per decoded block. For soft sync, at 48kHz this amounts to roughly 0.1% pitch distortion during the period of time it may take to sync (as long as the difference remains below the threshold). The implementation in sendspin-cli is materially different, though achieves a similar outcome; however before this patch the equivalent "soft sync" correction could warp the audio by up to 4% in its effort to bring the stream in sync in under 2 seconds. My patch here makes sendspin-cli significantly more in line with sendspin-cpp, despite the implementation difference. The pitch distortion targets a maximum of around 0.2%, which despite being double what sendspin-cpp targets, is still well below the audible range. If anything I'd say we could allow sendspin-cpp to add or remove up to two frames per block to sync more quickly while still maintaining inaudible status, but I think it's fine as-is practically. One option is to make sendspin-cli much more aligned with the implementation in sendspin-cpp, which would be a significant effort of again rewriting the core. Rather than this, based on my analysis I'd go with my patch to align the general behavior between the two. I believe it does so in the simplest way possible without massively changing the implementation in sendspin-cli. |
|
All our implementations should act the same. I think that we should decide which value (1 or 2 frames per block) we prefer, and when we should do hard syncs etc, and make sure it's done in all our implementations. |
|
From an end-user (audible) perspective, after this PR plus interpolating the corrected frames (cpp blends adjacent samples on insert/drop, cli currently duplicates/skips a raw frame), I think the two approaches are practically indistinguishable. That said, I don't think we should force every implementation onto one exact strategy. Each has real strengths:
On 1 vs 2 frames per block: in |
|
I agree with @maximmaxim345, the implementations are appropriate for each language, and they end up implementing the same general approach. I don't think making them precisely the same is a desirable goal as each has different strengths and constraints. This PR aligns them with each other behaviorally, while allowing each implementation to do what it's good at; CLI to Linux support and sample rates, and cpp to embedded and resource constrained environments. Best of both worlds I think. |
Summary
This reduces audible pitch shift / warble during Sendspin client playback startup and other stream transitions.
The main issue appears to be that the client can start playback with a consistent initial sync offset, then correct that offset quite aggressively using sample insert/drop correction. On my Linux endpoint this was especially noticeable at the beginning of tracks. Some discussion of this occurred at #107, however it was indeterminate and difficult to pin down the cause.
With analysis and help from GPT5.5, with my full understanding and detailed review, this PR makes three related changes:
Analysis
I was hearing pitch shift and warble on a fully up-to-date sendspin endpoint, most noticeably at playback start and during track changes.
Looking through
sendspin/audio.py, the most suspicious path was the sync correction logic. The previous drop correction branch did this:That effectively produced a duplicate-then-skip pattern, which is more audible than a simple one-frame drop.
Separately, the correction loop allowed up to +/-4% playback speed correction over a 2 second target window. On real playback that is enough to sound like pitch movement, especially right after startup when the first sync estimate is still settling.
This was changed to a maximum +/-0.2% correction over an 8 second window, which is more conservative, but still within reasonable sync delay expectations (counting to 8 will help provide confidence; if we believe users would reasonably be OK with out-of-sync clients converging within 8 seconds, then this is a reasonable default). Importantly, this is well below the threshold of audible pitch shift or warble while still providing a means to converge.
In any case, if the sync is too far out, a reanchor will be triggered.
Additionally, a 750ms sync correction delay was added; this does not delay audible playback, it only suppresses sample insert/drop correction briefly after playback enters the PLAYING state. That gives the DAC timing and clock-sync estimates a short window to settle before the client starts making speed adjustments based on them. In practice this avoids reacting to the first unstable startup measurements while still allowing scheduled playback to begin on time.
Real Endpoint Comparison
I tested this on my actual Sendspin Linux endpoint using the same daemon config, audio device, and negotiated format:
flac:48000:24:2The original version repeatedly started streams around
-42 mssync error, then corrected aggressively.Original observed debug stats:
00098.29%to100.11%82155Patched observed debug stats:
00099.78%to100.04%10621The startup offset is still visible, but correction is much less aggressive and avoids the previous drop-frame artifact.
Tests
Added focused regression tests for:
Local verification:
Results:
Testing for a day in real world scenarios has also been very successful.