Skip to content

Define minimum sync standards for the player@v1 role#104

Open
maximmaxim345 wants to merge 4 commits into
mainfrom
feat/define-minimum-player-syncing-standards
Open

Define minimum sync standards for the player@v1 role#104
maximmaxim345 wants to merge 4 commits into
mainfrom
feat/define-minimum-player-syncing-standards

Conversation

@maximmaxim345

@maximmaxim345 maximmaxim345 commented Jun 16, 2026

Copy link
Copy Markdown
Member

The specification didn't exactly define a minimum bar for synced playback. That meant that terribly out-of-sync players were still valid, while their end-user experience was unusable when grouped with other Sendspin players.

The new rules:

  • require the use of time-filter and the bursting strategy
  • client/time cadence floor
  • inaudible corrections in steady state
  • max ±0.5% speed in steady state (sliding average over the maximum chunk size, 150 ms)
  • ±2 ms accuracy in steady state
  • a rare one-shot resync (startup, underrun, large disturbance) is exempt from the speed and accuracy bounds
  • no startup warble
  • server chunk duration bounded to 15-150 ms (covers Opus at 20 ms and FLAC at 105 ms)

To give a head start for new implementations, this PR also adds a simple suggested strategy: discrete, bit-exact sample deletion and insertion. The player drops or duplicates whole frames to correct drift, leaving the audio untouched except at the moments it corrects. N scales with sample rate. Other algorithms like ASRC are still encouraged.

Constants

I'm still not sure what exact values we should pick. The values I picked in this draft are:

  • Maximum speed deviation ±0.5%: tighter than the cli's old ±4% warble; looser than what cpp/cli hold today (~0.1%/~0.2%). This is a ~8.6 cent pitch shift, on the edge of being inaudible with music. In steady state pitch tracks clock drift, so this cap is rarely reached.
  • Accuracy floor ±2 ms: achievable continuously by native clients. Might be difficult for some implementations like sendspin-js.
  • Accuracy target ±1 ms: in-room target, enough so individual speakers are not discernible when grouped.
  • Chunk duration bounds 15-150 ms: the 150 ms cap gives headroom over aiosendspin's current 105 ms max (FLAC at 44.1 kHz). The 15 ms floor keeps enough samples per chunk to correct within the ±0.5% cap.
  • Soft correction baseline ≈21 µs per chunk: one frame at 48 kHz, scaled by sample rate. Implementations may use a larger step to keep up with drift, bounded by the ±0.5% cap.
  • Dead band 100 µs: matches sendspin-cpp's existing value.
  • One-shot resync threshold 2 ms: set equal to the accuracy floor so the snap fires before the floor is violated.

Related issues

Make the correction-quality rule outcome-based (Inaudible corrections)
and exempt a rare one-shot resync from both the speed cap and the
accuracy floor. The speed cap is now a sliding average over 150 ms.
Interpolation only very slightly decreased distortion, lets drop it to
keep the Spec simpler. Other (better) strategies like ASRC are
encouraged.
@maximmaxim345 maximmaxim345 marked this pull request as ready for review June 17, 2026 07:53
maximmaxim345 added a commit to Sendspin/sendspin-js that referenced this pull request Jun 18, 2026
With the proposed minimum sync standards in Sendspin/spec#104 the
default `sync` mode in `sendspin-js` was no longer spec compliant.

This PR tweaks them to follow the spec.
There were two problems before this PR:
- Drift was corrected with up to ±2% playback-rate changes and was
therefore audible
- And startup errors could sit at 100-150 ms for tens of seconds because
a resync re-anchored its backlog forward instead of dropping it.
@maximmaxim345

Copy link
Copy Markdown
Member Author

The limits/constants even work for tricky platforms like sendspin-js running on Chromium (which randomizes the clock for security AFAIK). It still held within about +-1 ms there.

This works because we define accuracy from the Kalman filter to the output, not end-to-end. And since we also define how the filter has to be implemented and used, this essentially factors out the things we can't control, like network delay and stability.

So even listening through a VPN across the world, it might not be perfectly in sync, but the implementation is still spec compliant, since it's doing as well as it can.

Comment thread README.md

Each client is responsible for maintaining its own synchronization with the server's timestamps.

- **Accuracy floor:** In steady state, implementations MUST keep this error within ±2 ms. The only exception is the one-shot resynchronization exempted from the speed cap above, which MUST be rare.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 ms seems like a too high bound. Or are we limited in JS to consistently getting less than 1 ms?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendspin-js shouldn't be limiting the spec so we can lower it.
What about ±1ms? I wouldn't go lower without more testing though. From some initial tests ±0.5ms looks difficult to achieve with a USB DAC playing with sendspin-cli.

Comment thread README.md
Each client is responsible for maintaining its own synchronization with the server's timestamps.

- **Accuracy floor:** In steady state, implementations MUST keep this error within ±2 ms. The only exception is the one-shot resynchronization exempted from the speed cap above, which MUST be rare.
- **Accuracy target:** Implementations SHOULD aim for ±1 ms.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, are we just limited by the JS implementation from being stricter here?

@maximmaxim345 maximmaxim345 Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about ±0.5ms? It's a recommendation so we could technically go as low as we want. It what implementations target after all, if they run on slow hardware it just might be slightly above the threshold (but still compliant).

Comment thread README.md

- **Chunk duration bounds:** A server MUST NOT send an audio chunk longer than 150 ms, and SHOULD NOT send one shorter than 15 ms (the final chunk of a stream or the chunk before a format change MAY be shorter).
- The server sends audio to late-joining clients with future timestamps only, allowing them to buffer and start playback in sync with existing clients.
- After sending [`stream/start`](#server--client-streamstart) or [`stream/clear`](#server--client-streamclear) messages, servers must schedule the first audio timestamp far enough in the future to satisfy each player's [`required_lead_time_ms`](#client--server-clientstate-player-object) (startup warmup) and [`min_buffer_ms`](#client--server-clientstate-player-object) (ongoing jitter buffer). For live streams the buffer cannot grow after playback begins, so the larger of the two must already be reached before the first chunk plays.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a live stream, why would we take the larger of the two (not exactly this PR but a question that's been bugging me as I implement the player timing changes!)?

Presumably, on a live stream, you don't care about missing the first chunk of audio because... it's live, you've already missed the ongoing stuff. The network jitter is really the only one to care about in that scenario; i.e., how much buffer do I need to avoid audio drops?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required_lead_time isnt strictly about the missing content, its about how much time should happen between the stream/start and the timestamp in the first chunk. So to have required_lead and min_buffer work consistently for the player (no matter if its live or not), taking the larger of the two should make it work without having a buffer.

Buffered can of course ignore the min_buffer since they can just fetch more audio data. But with live streams only using required_lead or only using min_buffer would cause it to break one way or the other. If you only use required_lead and its smaller than the min_buffer, the buffer starts too shallow and since live cant grow it after playback starts you get dropouts the first time the network jitters. And if you only use min_buffer and its smaller than required_lead, then the first chunk lands less than required_lead after stream/start, which contradicts the spec since thats exactly the gap its supposed to guarantee.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do we have the two different parameters if we just take the max? If a client has a slow startup but rock solid network, I thin it would more useful to use just the network jitter measurement, even if that means the client needs to throw away a few chunks at the start (which is a violation of the spec as written, to be fair).

Let's see if I'm understanding by phrasing it a bit differently. If it isn't a live stream, then only required_lead is honored. If it is a live stream, then we take the max of required_lead and min_buffer. Which just effectively means the min_buffer parameter only matters if min_buffer > required_lead in the live case. Do I have that breakdown right for how the server behaves? I feel like we are missing an opportunity for the other case.

We can move this to Discord or a separate discussion so I don't keep polluting this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants