test(e2e): mTLS datapath scenario for Kind by Agent-Hellboy · Pull Request #339 · Agent-Hellboy/mcp-runtime

Agent-Hellboy · 2026-06-24T19:30:02Z

Summary

Add test/e2e/scenarios/mtls.sh with a dedicated mtls Kind E2E scenario that verifies test-mode mcp-runtime-ca, deploys an auth.mode: mtls MCPServer, rejects unauthenticated initialize on Traefik websecure, and accepts session-bound client certs from mcp-runtime adapter enroll.
Wire the scenario into kind.sh, PR path selection (select_pr_scenarios.sh), and scenario validation tests.

Test plan

bash test/e2e/scenarios_test.sh
E2E_SCENARIOS=mtls bash test/e2e/kind.sh (requires test-mode cluster with workload PKI from setup)

Notes

Depends on test-mode auto-provisioning of ClusterIssuer/mcp-runtime-ca and operator/runtime MCP_MTLS_CLUSTER_ISSUER wiring (merged via feat: mTLS/SPIFFE auth mode (Traefik-terminate + trusted header) #331 / follow-on setup PRs).
Select with E2E_SCENARIOS=mtls or via PR path selection for operator mTLS, cert-manager, spiffe-identity plugin, and gateway changes.

Made with Cursor

chatgpt-codex-connector · 2026-06-24T19:30:07Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

gemini-code-assist

Code Review

This pull request introduces a new end-to-end (E2E) test scenario for verifying the mTLS datapath in Kind environments, including the deployment of an mTLS-enabled MCPServer, certificate generation, and validation of client certificate authentication. The review feedback highlights two key improvements in the test script: first, removing the -f flag from curl during the spoofing check to prevent false positives when handling HTTP error responses, and second, ensuring both the gateway TLS and CA secrets are verified after the timeout loop to provide a clearer error message on failure.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Exercise test-mode workload PKI, Traefik websecure mTLS termination, adapter enroll, and session-bound client certificates without relying on governance headers. Co-authored-by: Cursor <cursoragent@cursor.com>

The terminate-and-re-encrypt mTLS model serves traffic through a Traefik IngressRoute and deletes the legacy passthrough IngressRouteTCP, but checkIngressReady still looked for the deleted IngressRouteTCP. That left every mTLS MCPServer stuck at PartiallyReady (ingressReady never true), which made the mTLS datapath e2e scenario time out waiting for phase Ready. Point the readiness check at the IngressRoute and cover both the ready and not-ready cases. Also address e2e review feedback in scenarios/mtls.sh: - drop curl -f on the spoofed-headers check so the gateway's 4xx/5xx body is captured and validated instead of silently skipped; - fail fast with a clear message if either the gateway-mtls or mtls-ca secret is missing after the wait loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mtls datapath drives traffic through Traefik's websecure entrypoint (TRAEFIK_TLS_PORT -> 8443); the plaintext web port-forward (18080) is never used by the scenario. Calling ensure_traefik_port_forward made the scenario time out waiting on localhost:18080 deep into the run for a port it doesn't need. Keep only the websecure port-forward. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wait_for_policy_text hardcoded the default ${SERVER_NAME}-gateway-policy ConfigMap, so the mtls scenario's waits for its grant and session never observed mtls-mcp-server-gateway-policy and timed out. Add an optional server argument (default SERVER_NAME) and pass MTLS_SERVER_NAME from the mtls scenario. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mcp-sentinel-config ConfigMap (rendered by renderAnalyticsConfigManifest) never received MCP_MTLS_CLUSTER_ISSUER, so runtime-api — which consumes it via envFrom and uses it to sign adapter/session CSRs — saw an empty issuer and returned "503 workload certificate issuer is not configured" on adapter enroll. The replacement only lived in renderAnalyticsManifest, which handles the other manifests that don't carry the key. Inject the issuer in the config renderer and drop the dead replacement. Tighten the mtls e2e check to assert the issuer on the ConfigMap (the real source via envFrom) instead of an inline env var that never existed, so a propagation regression fails fast with a clear message. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…erts HandleAdapterCertificate base64-encoded the raw CSR DER into the CertificateRequest spec.request, but cert-manager's admission webhook requires a PEM-encoded CSR and rejected it with "error decoding certificate request PEM block" — surfacing as "503 issue adapter certificate" on `adapter enroll` and breaking the mTLS datapath. PEM-encode the CSR before base64. Verified against a live mcp-runtime-ca issuer (request now issues) and covered by a regression test asserting spec.request decodes to a PEM CERTIFICATE REQUEST. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

An mtls scenario failure (e.g. adapter cert issuance) previously left no server-side evidence in the CI artifact bundle, forcing local repro to find the root cause. Add an ERR-trap diagnostics collector (with set -E so it fires inside functions and the adapter-enroll command substitution) that dumps the MCPServer, CertificateRequests, certs/secrets, sentinel ConfigMap, and runtime-api/platform-api/mtls-server logs into WORKDIR, which the EXIT trap archives. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The session-name regex used [^\\s]+, which (double-escaped) matched "not backslash or the letter s" rather than non-whitespace. Once adapter enroll started succeeding, its `.../session/<name> (expires ...)` output made the capture swallow " (expire" up to the s in "expires", so the policy wait searched for a garbled name and timed out. Match the session token explicitly ([A-Za-z0-9._-]+). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The datapath curls connected to https://127.0.0.1:<port> (SNI 127.0.0.1), so Traefik served its default cert and did not bind the router's client-cert TLS options to that handshake — server verification failed (curl exit 60) and mTLS enforcement wasn't exercised. Give the mtls server an ingressHost so the operator emits a Host()-scoped IngressRoute, reach it via curl --resolve so the SNI selects that router (and its client-cert options), and use -k because the caller-facing server cert is Traefik's self-signed default in HTTP test-mode (no default TLSStore). The scenario still verifies the client-cert mTLS property: no client cert is rejected, the session-bound cert is accepted, spoofed headers denied. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…overlay The mtls terminate-and-re-encrypt datapath is delivered through Traefik CRDs (IngressRoute/Middleware/TLSOption/ServersTransport) and the spiffe-identity local plugin, but the http overlay's args patch replaced the base args and dropped --providers.kubernetescrd, the websecure TLS default, and the spiffe-identity plugin registration. Traefik therefore ignored every IngressRoute the operator created and the mtls host 404'd. Re-add the CRD provider (scoped to the server namespaces), websecure http.tls, and the spiffe-identity plugin module while keeping web plaintext for the HTTP test flows. Verified on a live cluster: with the CRD provider enabled Traefik loads the IngressRoute and routes the host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Capture the Traefik routing layer (IngressRoute/Middleware/TLSOption/ ServersTransport CRs, the Traefik deployment and controller logs), the mcp-servers workload/pod describe/events, and a verbose curl replay of the datapath request (status, headers, body) into the archived artifact bundle, so a routing failure (404 vs 502 vs deny) is diagnosable from CI instead of requiring local repro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Widening the deadline wasn't enough: watch() captures its baseline modtime in its own goroutine, so under CI load it could start after the test rewrote the cert, baseline on cert-2's modtime, and never observe a change (the reload never fired, so it hit the deadline at ~10s). Push the cert modtime strictly forward on every poll so the watcher sees a fresh change no matter when its goroutine started, making the reload deterministic instead of racing the goroutine start. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Traefik v2.10 starts informers for every IngressRoute/Middleware/TLSOption/ ServersTransport/... kind in BOTH traefik.io and the legacy traefik.containo.us group whenever those CRDs are installed. The traefik-ingressclass ClusterRole only granted a subset of traefik.io, so list/watch on traefik.containo.us (and traefik.io udp/tcp kinds) was forbidden. That stalls the kubernetescrd provider, so no IngressRoute is loaded and mtls hosts fall through to Traefik's default cert and 404. Grant get/list/watch on the full kind set across both groups. Verified on a live cluster: the forbidden errors clear and the provider syncs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Agent-Hellboy · 2026-07-01T20:32:13Z

Addressed the review comments and rebased on latest main (resolves the merge conflict):

spoof-check curl -fsS → now uses -sS -k and captures the body so the deny is validated.
secret-wait → now checks both -gateway-mtls and -mtls-ca after the timeout loop.

While getting the datapath green end-to-end on the Kind e2e, several real gaps were found and fixed (each verified on a live cluster):

operator checkIngressReady tracked the deleted IngressRouteTCP instead of the IngressRoute → mtls servers stuck PartiallyReady;
runtime-api sent the adapter CSR as DER (cert-manager needs PEM) → 503 on adapter enroll;
MCP_MTLS_CLUSTER_ISSUER was never propagated into the sentinel ConfigMap;
http ingress overlay dropped --providers.kubernetescrd + the spiffe-identity plugin;
Traefik ClusterRole lacked RBAC for the traefik.containo.us CRD group (+ udp/tcp kinds), stalling the CRD provider.

Also expanded the e2e failure diagnostics (Traefik CRs/logs, verbose curl replay) so future routing failures are debuggable from the artifact bundle.

…er port Traefik's KubernetesCRD provider resolves IngressRoute service references by looking up the port in Service.spec.ports. The mTLS IngressRoute was passing gateway.port (8091 — the container/targetPort) instead of servicePort (80 — the Service's exposed port). Traefik logged "service port not found: 8091" and skipped the route entirely, falling back to a 404 with its default cert. The correct value is ServicePort (80). Traefik discovers pod endpoints from the Service and connects to each pod at the resolved targetPort (8091) automatically — the IngressRoute only needs the Service-level port. Adds a regression assertion to TestReconcileMTLSIngressGeneratesTraefikResources and seeds ServicePort=80 in the shared mtlsServer() fixture. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When cert-manager issues the CA bundle secrets after MCPServer creation, Traefik receives a watch event and reloads its TLS config (TLSOption and ServersTransport). The reload briefly closes the websecure (8443) listener, which breaks the port-forward with "broken pipe" and leaves port 18443 refusing connections. Add recover_traefik_tls_port_forward_if_needed() to kind.sh (mirrors recover_traefik_port_forward_if_needed for the HTTP port) and call it in mtls.sh before the authenticated mTLS curl, which runs ~30s after the initial port-forward start (grant + adapter-enroll round-trips in between). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… curls Traefik applies TLSOption and ServersTransport atomically during a config reload triggered by cert-manager issuing the CA bundle secret. Until that reload completes, TLSOption lacks its CAFiles so RequireAndVerifyClientCert is not enforced: the TLS handshake succeeds without a client cert, Traefik routes the request via plain HTTP (ServersTransport not yet loaded), and the TLS-only gateway returns HTTP 400. Add wait_for_mtls_traefik_ready() in kind.sh that polls until a no-cert curl exits 35 (TLS handshake failure = certificate_required alert). This proves TLSOption is active; because both TLSOption and ServersTransport are reloaded in the same Traefik config snapshot, exit-35 also implies ServersTransport is using TLS to the backend. Call it in mtls.sh after the port-forward is up and before the reject/accept curl sequence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ty check In Kind clusters, kubectl port-forward exits on any TCP error from the upstream pod — including the RST Traefik sends after a TLS certificate_required alert. The previous wait_for_mtls_traefik_ready approach probed the websecure port with curl, which triggered this RST on every iteration, creating an infinite port-forward restart loop that always timed out (exit 56 each time). Replace with wait_for_mtls_traefik_stable, which polls kubectl logs until no level=error lines mentioning the server name appear in a 6s window. This proves TLSOption (RequireAndVerifyClientCert + CA loaded) and ServersTransport (TLS to gateway) are both applied without touching the websecure port at all. After the stability wait, recover the port-forward once (Traefik's initial secret-loading retries may have broken it) before proceeding to the reject/accept curl sequence. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without scheme: https, Traefik defaults to plain HTTP for backends on port 80. The ServersTransport TLS config (CA, client cert, serverName) is only applied when Traefik connects via HTTPS, so omitting the scheme caused Traefik to forward requests over HTTP and the gateway responded with 400 "Client sent an HTTP request to an HTTPS server". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The gateway reloads its policy from the volume-mounted ConfigMap every 5s. wait_for_policy_text confirms the ConfigMap API is updated, but the kubelet may not have propagated the change to the pod's volume mount yet, so the next gateway reload can still read the old policy. Previously the accepting initialize curl ran immediately after wait_for_policy_text, racing the policy reload cycle and returning 401 session_not_found. Retry up to 15 times (30s) until the response contains "result", with port-forward recovery and a 2s sleep between attempts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wait_for_policy_text checks the ConfigMap via the API server, but the kubelet propagates ConfigMap updates to volume mounts on its own sync period. On loaded Kind CI nodes this lag exceeded 30s, so all 15 curl retries saw an stale policy file in the gateway pod and returned 401 session_not_found. Add wait_for_gateway_policy_file: kubectl-execs into the gateway pod and polls /var/run/mcp-runtime/policy/policy.json (the actual file the gateway reloads every 5s) until the session name appears, with a 180s deadline. The curl retry loop is kept as a safety net for the gateway's 5s reload tick after the file is updated. Diagnostics improvements: - Dump the in-pod policy file and ConfigMap side by side so future failures can distinguish kubelet propagation lag from operator bugs. - Use fully-qualified group names for Traefik CRD dumps (traefik.io / traefik.containo.us) to avoid ambiguous bare names that return empty lists when both API groups are registered. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/scenarios/mtls.sh Outdated

Comment thread test/e2e/scenarios/mtls.sh Outdated

Agent-Hellboy force-pushed the test/mtls_e2e_datapath branch from fb5ac6e to 724831c Compare June 30, 2026 10:08

Agent-Hellboy and others added 13 commits July 2, 2026 01:59

test(e2e): add mtls datapath scenario for Kind

80b4e5e

Exercise test-mode workload PKI, Traefik websecure mTLS termination, adapter enroll, and session-bound client certificates without relying on governance headers. Co-authored-by: Cursor <cursoragent@cursor.com>

Agent-Hellboy force-pushed the test/mtls_e2e_datapath branch from 505098a to ea70f47 Compare July 1, 2026 20:31

Agent-Hellboy and others added 7 commits July 2, 2026 02:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): mTLS datapath scenario for Kind#339

test(e2e): mTLS datapath scenario for Kind#339
Agent-Hellboy wants to merge 20 commits into
mainfrom
test/mtls_e2e_datapath

Agent-Hellboy commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Agent-Hellboy commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Agent-Hellboy commented Jun 24, 2026

Summary

Test plan

Notes

Uh oh!

chatgpt-codex-connector Bot commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Agent-Hellboy commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant