fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620)#1117
Merged
therealaleph merged 1 commit intoMay 13, 2026
Merged
Conversation
therealaleph
pushed a commit
that referenced
this pull request
May 13, 2026
) Fixes #1088 — under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10s "batch timeout" errors and lost download progress on Telegram / browser sessions. ## Root cause `read_http_response` in `domain_fronter.rs` had a **hardcoded 10s header-read timeout** that ran *inside* `tunnel_batch_request_to` — independent of and shorter than the outer `tokio::time::timeout(batch_timeout, …)` in `fire_batch`. Apps Script cold starts routinely land in the 8-12s range (PR #1040's A/B recorded 4/30 H1 batches timing out at exactly 10s after the H2→H1 switch), so the inner cliff fired as a false-positive batch timeout well before `request_timeout_secs` (default 30s) could. Secondary: even with a parameterized timeout, the per-read `timeout(d, stream.read(...))` form would silently extend its budget if a peer drip-fed bytes just under `d` each — a slow edge could keep the loop alive past the outer `batch_timeout` and defeat the whole wiring. ## Fix (two changes in `domain_fronter.rs`) 1. **`tunnel_batch_request_to` passes `batch_timeout` to the header read** via new `read_http_response_with_header_timeout` helper. `Config::request_timeout_secs` is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10s value. 2. **Header read uses a single absolute deadline** (`tokio::time::timeout_at(deadline, …)`) instead of per-read `timeout()`. Total elapsed across all header reads is bounded by `header_read_timeout`, regardless of read cadence. ## Bonus (in `tunnel_client.rs`) 3. **`TunnelMux::reply_timeout` co-varies with `batch_timeout`**: computed at construction as `fronter.batch_timeout() + 5s slack` instead of the fixed 35s const. Operators raising `request_timeout_secs` no longer have sessions abandon `reply_rx` just before `fire_batch`'s HTTP round-trip would complete. ## Verified locally (on top of v1.9.23 / main after #1117 merge) - `cargo test --lib --release`: **231/231** ✅ (was 209 in v1.9.23 baseline; this PR adds 22 new tests covering the deadline/co-variance behavior) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅ ## Interaction with v1.9.20 (PR #1029) PR #1029 added `H1_OPEN_TIMEOUT_SECS = 8` to bound the TCP+TLS handshake in `open()`. That bound is **separate** from the header-read timeout this PR addresses — both bounds exist in the same call chain. Issue #1131 (BuffOvrFlw, just opened) reports `h1 open timed out after 8s` errors which are the `open()` bound firing, not the header-read bound. Worth a follow-up to make `H1_OPEN_TIMEOUT_SECS` parameterized too, but that's a separate change. Reviewed via Anthropic Claude. Co-Authored-By: dazzling-no-more <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
therealaleph
added a commit
that referenced
this pull request
May 13, 2026
#1088, #620) Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more: - #1108 (#1088): batch header read honors request_timeout_secs. Closes the 10s inner timeout cliff that was cascade-killing tunnel sessions under slow Apps Script edges. +22 regression tests (231 total). - #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without BuildKit. Cloud Run's gcloud-deploy path now works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #620. The tunnel-node
Dockerfileused BuildKit-onlyRUN --mount=type=cachedirectives, which fail on Cloud Run'sgcloud run deploy --source .path because the underlyinggcr.io/cloud-builders/dockerbuilder does not enable BuildKit (and--set-build-env-vars DOCKER_BUILDKIT=1doesn't actually flip it on either, per the issue reporter's logs).Reworked the Dockerfile to use the
cargo-chefpattern instead: a dedicated planner stage emits arecipe.jsonof dependency metadata, acargo chef cookstage builds just the deps in their own Docker layer, and the final build stage addssrc/on top. Docker's regular layer cache now handles dependency reuse, so warm rebuilds where onlysrc/changes still skip the slow crate compile — no BuildKit required.Changes
tunnel-node/Dockerfile:# syntax=docker/dockerfile:1parser directive and allRUN --mount=type=cache,...blocks.chef→planner→builder).cargo-chefto exact0.1.77with--lockedfor reproducible installs.rust:1.85-slim→rust:1.90-slim(cargo-chef's transitive deps require rustc 1.86+; tunnel-node'sCargo.tomlhas norust-versionpin so the bump is free).ARG TARGETPLATFORMper-platform cache-id workaround from08efbc5— Docker's regular layer cache is already arch-scoped, so the multi-arch race that motivated it doesn't apply anymore.Non-changes (deliberate)
tunnel-node/Cargo.toml— not touched. The old Dockerfile comment claimed "matches MSRV in Cargo.toml" but norust-versionfield actually exists. The Docker base bump is an internal build-environment choice; conflating it with a declared MSRV would artificially restrict thecargo build --releasedirect-binary path in the README.rust:1.90-slim,debian:bookworm-slim). Without a Renovate/Dependabot bot to keep digests fresh, digest pinning trades automatic glibc/openssl/ca-certificates CVE patching for a reproducibility property this repo doesn't currently need.Test plan
Verified on Docker Desktop 29.4.2 (Windows):
DOCKER_BUILDKIT=1 docker build --no-cache .→ succeeds, 130 MB final imageDOCKER_BUILDKIT=0 docker build --no-cache .→ succeeds (Cloud Run's classic-builder path), 130 MB final image, same contentdocker run -e TUNNEL_AUTH_KEY=testkey ...→ container starts, logs "tunnel-node listening on 0.0.0.0:8080"curl http://localhost:8080/health→okgcloud run deploy --source .(requires a GCP project — recommend the issue reporter @r-safavi re-run the original command from Cloud Run (Google Cloud) as the tunnel node in full tunnel mode does not get deployed #620 to confirm)Related
Closes #620.