Skip to content

fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620)#1117

Merged
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/tunnel-node-buildkit
May 13, 2026
Merged

fix(docker): cargo-chef so tunnel-node builds without BuildKit (#620)#1117
therealaleph merged 1 commit into
therealaleph:mainfrom
dazzling-no-more:fix/tunnel-node-buildkit

Conversation

@dazzling-no-more
Copy link
Copy Markdown
Contributor

Summary

Fixes #620. The tunnel-node Dockerfile used BuildKit-only RUN --mount=type=cache directives, which fail on Cloud Run's gcloud run deploy --source . path because the underlying gcr.io/cloud-builders/docker builder does not enable BuildKit (and --set-build-env-vars DOCKER_BUILDKIT=1 doesn't actually flip it on either, per the issue reporter's logs).

Reworked the Dockerfile to use the cargo-chef pattern instead: a dedicated planner stage emits a recipe.json of dependency metadata, a cargo chef cook stage builds just the deps in their own Docker layer, and the final build stage adds src/ on top. Docker's regular layer cache now handles dependency reuse, so warm rebuilds where only src/ changes still skip the slow crate compile — no BuildKit required.

Changes

  • tunnel-node/Dockerfile:
    • Dropped # syntax=docker/dockerfile:1 parser directive and all RUN --mount=type=cache,... blocks.
    • Added cargo-chef multi-stage build (chefplannerbuilder).
    • Pinned cargo-chef to exact 0.1.77 with --locked for reproducible installs.
    • Bumped base from rust:1.85-slimrust:1.90-slim (cargo-chef's transitive deps require rustc 1.86+; tunnel-node's Cargo.toml has no rust-version pin so the bump is free).
    • Removed ARG TARGETPLATFORM per-platform cache-id workaround from 08efbc5 — Docker's regular layer cache is already arch-scoped, so the multi-arch race that motivated it doesn't apply anymore.

Non-changes (deliberate)

  • tunnel-node/Cargo.toml — not touched. The old Dockerfile comment claimed "matches MSRV in Cargo.toml" but no rust-version field actually exists. The Docker base bump is an internal build-environment choice; conflating it with a declared MSRV would artificially restrict the cargo build --release direct-binary path in the README.
  • Base image digest pinning — left on tag references (rust:1.90-slim, debian:bookworm-slim). Without a Renovate/Dependabot bot to keep digests fresh, digest pinning trades automatic glibc/openssl/ca-certificates CVE patching for a reproducibility property this repo doesn't currently need.

Test plan

Verified on Docker Desktop 29.4.2 (Windows):

  • DOCKER_BUILDKIT=1 docker build --no-cache . → succeeds, 130 MB final image
  • DOCKER_BUILDKIT=0 docker build --no-cache . → succeeds (Cloud Run's classic-builder path), 130 MB final image, same content
  • docker run -e TUNNEL_AUTH_KEY=testkey ... → container starts, logs "tunnel-node listening on 0.0.0.0:8080"
  • curl http://localhost:8080/healthok
  • End-to-end Cloud Run deploy via gcloud run deploy --source . (requires a GCP project — recommend the issue reporter @r-safavi re-run the original command from Cloud Run (Google Cloud) as the tunnel node in full tunnel mode does not get deployed #620 to confirm)

Related

Closes #620.

@github-actions github-actions Bot added the type: fix fix: PR — auto-applied by release-drafter label May 13, 2026
@therealaleph therealaleph merged commit 4d2ce91 into therealaleph:main May 13, 2026
1 check passed
therealaleph pushed a commit that referenced this pull request May 13, 2026
)

Fixes #1088 — under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10s "batch timeout" errors and lost download progress on Telegram / browser sessions.

## Root cause

`read_http_response` in `domain_fronter.rs` had a **hardcoded 10s header-read timeout** that ran *inside* `tunnel_batch_request_to` — independent of and shorter than the outer `tokio::time::timeout(batch_timeout, …)` in `fire_batch`. Apps Script cold starts routinely land in the 8-12s range (PR #1040's A/B recorded 4/30 H1 batches timing out at exactly 10s after the H2→H1 switch), so the inner cliff fired as a false-positive batch timeout well before `request_timeout_secs` (default 30s) could.

Secondary: even with a parameterized timeout, the per-read `timeout(d, stream.read(...))` form would silently extend its budget if a peer drip-fed bytes just under `d` each — a slow edge could keep the loop alive past the outer `batch_timeout` and defeat the whole wiring.

## Fix (two changes in `domain_fronter.rs`)

1. **`tunnel_batch_request_to` passes `batch_timeout` to the header read** via new `read_http_response_with_header_timeout` helper. `Config::request_timeout_secs` is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10s value.

2. **Header read uses a single absolute deadline** (`tokio::time::timeout_at(deadline, …)`) instead of per-read `timeout()`. Total elapsed across all header reads is bounded by `header_read_timeout`, regardless of read cadence.

## Bonus (in `tunnel_client.rs`)

3. **`TunnelMux::reply_timeout` co-varies with `batch_timeout`**: computed at construction as `fronter.batch_timeout() + 5s slack` instead of the fixed 35s const. Operators raising `request_timeout_secs` no longer have sessions abandon `reply_rx` just before `fire_batch`'s HTTP round-trip would complete.

## Verified locally (on top of v1.9.23 / main after #1117 merge)

- `cargo test --lib --release`: **231/231** ✅ (was 209 in v1.9.23 baseline; this PR adds 22 new tests covering the deadline/co-variance behavior)
- `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅

## Interaction with v1.9.20 (PR #1029)

PR #1029 added `H1_OPEN_TIMEOUT_SECS = 8` to bound the TCP+TLS handshake in `open()`. That bound is **separate** from the header-read timeout this PR addresses — both bounds exist in the same call chain. Issue #1131 (BuffOvrFlw, just opened) reports `h1 open timed out after 8s` errors which are the `open()` bound firing, not the header-read bound. Worth a follow-up to make `H1_OPEN_TIMEOUT_SECS` parameterized too, but that's a separate change.

Reviewed via Anthropic Claude.

Co-Authored-By: dazzling-no-more <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
therealaleph added a commit that referenced this pull request May 13, 2026
#1088, #620)

Bumps v1.9.23 → v1.9.24. Two PRs from @dazzling-no-more:
- #1108 (#1088): batch header read honors request_timeout_secs.
  Closes the 10s inner timeout cliff that was cascade-killing tunnel
  sessions under slow Apps Script edges. +22 regression tests (231 total).
- #1117 (#620): cargo-chef Dockerfile so tunnel-node builds without
  BuildKit. Cloud Run's gcloud-deploy path now works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dazzling-no-more dazzling-no-more deleted the fix/tunnel-node-buildkit branch May 13, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix fix: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cloud Run (Google Cloud) as the tunnel node in full tunnel mode does not get deployed

2 participants