fix(tunnel): batch header read honors request_timeout_secs (#1088)#1108
Merged
therealaleph merged 1 commit intoMay 13, 2026
Conversation
Contributor
Author
|
Confirmed by a fresh field report in #1106 — log shows |
therealaleph
added a commit
that referenced
this pull request
May 13, 2026
#1088, #620) Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more: - #1108 (#1088): batch header read honors request_timeout_secs. Closes the 10s inner timeout cliff that was cascade-killing tunnel sessions under slow Apps Script edges. +22 regression tests (231 total). - #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without BuildKit. Cloud Run's gcloud-deploy path now works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 13, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Issue #1088: under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10 s "batch timeout" errors and lost download progress on Telegram / browser sessions.
Root cause:
read_http_responseindomain_fronter.rshad a hardcoded 10 s header-read timeout that ran insidetunnel_batch_request_to— independent of and shorter than the outertokio::time::timeout(batch_timeout, ...)infire_batch. Apps Script cold starts routinely land in the 8-12 s range (PR #1040's A/B test recorded 4/30 H1 batches timing out at exactly 10 s after the H2→H1 switch), so the inner cliff was firing as a false-positive batch timeout well beforerequest_timeout_secs(default 30 s) could.Secondary issue: even with a parameterized timeout, the per-read
timeout(d, stream.read(...))form would silently extend its budget if a peer drip-fed bytes just underdeach. A slow edge could keep the loop alive past the outerbatch_timeoutand defeat the whole wiring.Fix
Two changes, both in
domain_fronter.rs:tunnel_batch_request_topassesbatch_timeoutto the header read via the newread_http_response_with_header_timeouthelper.Config::request_timeout_secsis now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10 s value.Header read uses a single absolute deadline (
tokio::time::timeout_at(deadline, ...)) instead of per-readtimeout(). Total elapsed across all header reads is bounded byheader_read_timeout, regardless of read cadence.Bonus in
tunnel_client.rs:TunnelMux::reply_timeoutco-varies withbatch_timeout: computed at construction asfronter.batch_timeout() + 5 s slackinstead of the fixed 35 s const. Operators raisingrequest_timeout_secsno longer have sessions abandonreply_rxjust beforefire_batch's HTTP round-trip would complete.What this does NOT do
Earlier iterations of this branch added client-side batch retry (re-fire the same payload against a different deployment on timeout). That code was removed before merge: tunnel-node's
drain_nowmutates the per-session buffer when building a poll response, so a lost response means lost bytes — replaying an empty poll silently skips bytes (gap on the client side), and replaying a payload op double-writes (corrupts TCP / double-delivers UDP). No op shape is safe to replay without server-side ack / sequence support. Sessions whose batch times out re-poll on the next tick, same recovery surface as pre-#1088.Changes
domain_fronter.rs:tunnel_batch_request_to: passbatch_timeoutto header read (+9 lines).read_http_response_with_header_timeout(stream, deadline)(+25 lines).timeout()withtimeout_at(deadline, ...)in the header loop.tunnel_client.rs:TunnelMux::reply_timeoutfield, derived frombatch_timeoutat construction.tunnel_loopreadsmux.reply_timeout()instead of theREPLY_TIMEOUTconst.REPLY_TIMEOUT: Durationconst withREPLY_TIMEOUT_SLACK: Duration = 5 s.Test coverage
Three new tests, all using
tokio::time::pause()so they run in < 1 s of wall time:read_http_response_respects_configured_header_timeout— response starts at virtual T=15 s; must succeed under a 30 s budget. Catches any regression to the hardcoded 10 s inner cliff.read_http_response_header_deadline_is_total_not_per_read— drip-feeds 38 header bytes at 3 s/byte. Each read returns within 3 s (< 10 s per-read budget) but total exceeds 10 s. AssertsFronterError::Timeout. Catches any regression to per-iterationtimeout().mux_reply_timeout_tracks_batch_timeout_plus_slack— constructs a realDomainFronterwithrequest_timeout_secs: 60viaTunnelMux::start; assertsmux.reply_timeout() == 60 s + REPLY_TIMEOUT_SLACK. Catches any hardcoded-value regression in the runtime derivation.Verification
cargo test: 231/231 ✅ (was 228 pre-patch + 3 new)cargo check: clean ✅cargo clippy -W unused: clean ✅Interaction with PR #1040 / #1041 (H2 → H1 switch)
PR #1040's A/B numbers showed 4/30 H1 batches timing out at 10 s after the H2→H1 switch — that was this same 10 s inner cliff. This PR addresses the residual timeout regression #1040 introduced without reverting the H2→H1 decision, which remains correct for tunnel-batch coalesced ops.
Closes #1088.