Skip to content

fix(sandbox): venv execve envp, loopback TLS, remove since, timeout race #493

Merged
xiaods merged 9 commits into
mainfrom
dev
Jun 5, 2026
Merged

fix(sandbox): venv execve envp, loopback TLS, remove since, timeout race #493
xiaods merged 9 commits into
mainfrom
dev

Conversation

@xiaods

@xiaods xiaods commented Jun 5, 2026

Copy link
Copy Markdown
Owner

No description provided.

xiaods added 3 commits June 5, 2026 21:30
- venv: Dockerfile creates /workspace/.venv with ENV PATH; sandboxd passes
  proper envp to execve (was empty, losing Dockerfile PATH); ensureVenv
  restored in handleExec for workspace reset recovery
- loopback: resolveCreds now receives endpoint for InsecureSkipVerify
  when connecting to 127.0.0.1 (previously only dialMTLS path had this)
- since: remove non-functional --since filter from proto through CLI,
  gRPC, and sandboxd — mtime was always 0 so filtering was broken
- timeout: exec.zig spawnTimeoutKiller probes kill(pid,0) before
  SIGKILL to avoid killing a reused PID after process already exited
Previously the build-sandbox-cli job used raw go build without -X
ldflags, so version.Version stayed at its Go default "dev". Switch to
zig build sandboxcli which runs hack/version.sh and injects the release
tag via build.zig's addSandboxCLIBuild function, matching how the main
k8e binary is built.
… compat

gVisor's sentry userspace netstack cannot handle Cilium's eBPF L7 DNS
proxy redirects at all — not just TCP but UDP too. The split-DNS
approach (UDP+L7, TCP plain) from 06b5c4b assumed UDP was safe, but
gVisor rejects the eBPF redirect on both protocols, causing DNS timeout
and complete egress failure.

Replace with a single protocol:ANY DNS rule without L7 rules. The
Cilium agent independently resolves FQDNs from the toFQDNs policy list,
so egress filtering still works without DNS response interception.
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Test Results

306 tests  ±0   306 ✅ ±0   4m 20s ⏱️ -1s
110 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 9ce1689. ± Comparison against base commit 64282da.

♻️ This comment has been updated with latest results.

xiaods added 6 commits June 5, 2026 22:13
…ds complexity

- Add loopbackTLSConfig helper with VerifyConnection callback that validates
  the server cert chain against RootCAs while skipping hostname verification
  (loopback certs are not issued for 127.0.0.1). Replaces 4 bare
  InsecureSkipVerify sites, so the loaded cert pool is no longer dead code.
- Refactor resolveCreds (cognitive complexity 23→below 15) by extracting
  resolveCredsFromEnv, resolveCredsFromTLSFiles, resolveCredsFromKubeconfig,
  and resolveCredsFallback helpers.
When a pod is deleted externally (kubectl delete pod), getPodIP
returned the cached status.podIP without verifying the pod still
exists, causing the HTTP dial to stall until timeout. Now it queries
pods by label first and returns codes.NotFound if the pod is gone.

Also extend isSessionExpired to recognize codes.Unavailable as a
safety net for other unreachable-pod scenarios.
Cert revocation (CRL/OCSP) checking is not applicable for loopback
connections authenticated by the internal cluster CA with short-lived
certificates. Add NOSONAR annotation to document this design decision.
…opbackTLSConfig

- Extract pollForPodIP helper from getPodIP to keep complexity below 15
- Add NOSONAR annotation on loopbackTLSConfig function declaration
…ehavior

TestIsSessionExpired_grpcUnavailableNotNoPodIP expected false for
codes.Unavailable, but isSessionExpired now recognizes Unavailable as
an expired session. Rename test and flip expectation to true.
@sonarqubecloud

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown

@xiaods xiaods merged commit ee334e2 into main Jun 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant