fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620) by dazzling-no-more · Pull Request #1117 · therealaleph/MasterHttpRelayVPN-RUST

dazzling-no-more · 2026-05-13T04:51:04Z

Summary

Fixes #620. The tunnel-node Dockerfile used BuildKit-only RUN --mount=type=cache directives, which fail on Cloud Run's gcloud run deploy --source . path because the underlying gcr.io/cloud-builders/docker builder does not enable BuildKit (and --set-build-env-vars DOCKER_BUILDKIT=1 doesn't actually flip it on either, per the issue reporter's logs).

Reworked the Dockerfile to use the cargo-chef pattern instead: a dedicated planner stage emits a recipe.json of dependency metadata, a cargo chef cook stage builds just the deps in their own Docker layer, and the final build stage adds src/ on top. Docker's regular layer cache now handles dependency reuse, so warm rebuilds where only src/ changes still skip the slow crate compile — no BuildKit required.

Changes

tunnel-node/Dockerfile:
- Dropped # syntax=docker/dockerfile:1 parser directive and all RUN --mount=type=cache,... blocks.
- Added cargo-chef multi-stage build (chef → planner → builder).
- Pinned cargo-chef to exact 0.1.77 with --locked for reproducible installs.
- Bumped base from rust:1.85-slim → rust:1.90-slim (cargo-chef's transitive deps require rustc 1.86+; tunnel-node's Cargo.toml has no rust-version pin so the bump is free).
- Removed ARG TARGETPLATFORM per-platform cache-id workaround from 08efbc5 — Docker's regular layer cache is already arch-scoped, so the multi-arch race that motivated it doesn't apply anymore.

Non-changes (deliberate)

tunnel-node/Cargo.toml — not touched. The old Dockerfile comment claimed "matches MSRV in Cargo.toml" but no rust-version field actually exists. The Docker base bump is an internal build-environment choice; conflating it with a declared MSRV would artificially restrict the cargo build --release direct-binary path in the README.
Base image digest pinning — left on tag references (rust:1.90-slim, debian:bookworm-slim). Without a Renovate/Dependabot bot to keep digests fresh, digest pinning trades automatic glibc/openssl/ca-certificates CVE patching for a reproducibility property this repo doesn't currently need.

Test plan

Verified on Docker Desktop 29.4.2 (Windows):

DOCKER_BUILDKIT=1 docker build --no-cache . → succeeds, 130 MB final image
DOCKER_BUILDKIT=0 docker build --no-cache . → succeeds (Cloud Run's classic-builder path), 130 MB final image, same content
docker run -e TUNNEL_AUTH_KEY=testkey ... → container starts, logs "tunnel-node listening on 0.0.0.0:8080"
curl http://localhost:8080/health → ok
End-to-end Cloud Run deploy via gcloud run deploy --source . (requires a GCP project — recommend the issue reporter @r-safavi re-run the original command from Cloud Run (Google Cloud) as the tunnel node in full tunnel mode does not get deployed #620 to confirm)

) Fixes #1088 — under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10s "batch timeout" errors and lost download progress on Telegram / browser sessions. ## Root cause `read_http_response` in `domain_fronter.rs` had a **hardcoded 10s header-read timeout** that ran *inside* `tunnel_batch_request_to` — independent of and shorter than the outer `tokio::time::timeout(batch_timeout, …)` in `fire_batch`. Apps Script cold starts routinely land in the 8-12s range (PR #1040's A/B recorded 4/30 H1 batches timing out at exactly 10s after the H2→H1 switch), so the inner cliff fired as a false-positive batch timeout well before `request_timeout_secs` (default 30s) could. Secondary: even with a parameterized timeout, the per-read `timeout(d, stream.read(...))` form would silently extend its budget if a peer drip-fed bytes just under `d` each — a slow edge could keep the loop alive past the outer `batch_timeout` and defeat the whole wiring. ## Fix (two changes in `domain_fronter.rs`) 1. **`tunnel_batch_request_to` passes `batch_timeout` to the header read** via new `read_http_response_with_header_timeout` helper. `Config::request_timeout_secs` is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10s value. 2. **Header read uses a single absolute deadline** (`tokio::time::timeout_at(deadline, …)`) instead of per-read `timeout()`. Total elapsed across all header reads is bounded by `header_read_timeout`, regardless of read cadence. ## Bonus (in `tunnel_client.rs`) 3. **`TunnelMux::reply_timeout` co-varies with `batch_timeout`**: computed at construction as `fronter.batch_timeout() + 5s slack` instead of the fixed 35s const. Operators raising `request_timeout_secs` no longer have sessions abandon `reply_rx` just before `fire_batch`'s HTTP round-trip would complete. ## Verified locally (on top of v1.9.23 / main after #1117 merge) - `cargo test --lib --release`: **231/231** ✅ (was 209 in v1.9.23 baseline; this PR adds 22 new tests covering the deadline/co-variance behavior) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅ ## Interaction with v1.9.20 (PR #1029) PR #1029 added `H1_OPEN_TIMEOUT_SECS = 8` to bound the TCP+TLS handshake in `open()`. That bound is **separate** from the header-read timeout this PR addresses — both bounds exist in the same call chain. Issue #1131 (BuffOvrFlw, just opened) reports `h1 open timed out after 8s` errors which are the `open()` bound firing, not the header-read bound. Worth a follow-up to make `H1_OPEN_TIMEOUT_SECS` parameterized too, but that's a separate change. Reviewed via Anthropic Claude. Co-Authored-By: dazzling-no-more <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@dazzling-no-more

#1088, #620) Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more: - #1108 (#1088): batch header read honors request_timeout_secs. Closes the 10s inner timeout cliff that was cascade-killing tunnel sessions under slow Apps Script edges. +22 regression tests (231 total). - #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without BuildKit. Cloud Run's gcloud-deploy path now works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(docker): cargo-chef so tunnel-node builds without BuildKit (there…

5be9107

…alaleph#620)

github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 13, 2026

therealaleph merged commit 4d2ce91 into therealaleph:main May 13, 2026
1 check passed

dazzling-no-more deleted the fix/tunnel-node-buildkit branch May 13, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620)#1117

fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620)#1117
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/tunnel-node-buildkit

dazzling-no-more commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dazzling-no-more commented May 13, 2026

Summary

Changes

Non-changes (deliberate)

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants