Skip to content

test(e2e): mTLS datapath scenario for Kind#339

Open
Agent-Hellboy wants to merge 20 commits into
mainfrom
test/mtls_e2e_datapath
Open

test(e2e): mTLS datapath scenario for Kind#339
Agent-Hellboy wants to merge 20 commits into
mainfrom
test/mtls_e2e_datapath

Conversation

@Agent-Hellboy

Copy link
Copy Markdown
Owner

Summary

  • Add test/e2e/scenarios/mtls.sh with a dedicated mtls Kind E2E scenario that verifies test-mode mcp-runtime-ca, deploys an auth.mode: mtls MCPServer, rejects unauthenticated initialize on Traefik websecure, and accepts session-bound client certs from mcp-runtime adapter enroll.
  • Wire the scenario into kind.sh, PR path selection (select_pr_scenarios.sh), and scenario validation tests.

Test plan

  • bash test/e2e/scenarios_test.sh
  • E2E_SCENARIOS=mtls bash test/e2e/kind.sh (requires test-mode cluster with workload PKI from setup)

Notes

  • Depends on test-mode auto-provisioning of ClusterIssuer/mcp-runtime-ca and operator/runtime MCP_MTLS_CLUSTER_ISSUER wiring (merged via feat: mTLS/SPIFFE auth mode (Traefik-terminate + trusted header) #331 / follow-on setup PRs).
  • Select with E2E_SCENARIOS=mtls or via PR path selection for operator mTLS, cert-manager, spiffe-identity plugin, and gateway changes.

Made with Cursor

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new end-to-end (E2E) test scenario for verifying the mTLS datapath in Kind environments, including the deployment of an mTLS-enabled MCPServer, certificate generation, and validation of client certificate authentication. The review feedback highlights two key improvements in the test script: first, removing the -f flag from curl during the spoofing check to prevent false positives when handling HTTP error responses, and second, ensuring both the gateway TLS and CA secrets are verified after the timeout loop to provide a clearer error message on failure.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread test/e2e/scenarios/mtls.sh Outdated
Comment thread test/e2e/scenarios/mtls.sh Outdated
@Agent-Hellboy Agent-Hellboy force-pushed the test/mtls_e2e_datapath branch from fb5ac6e to 724831c Compare June 30, 2026 10:08
Agent-Hellboy and others added 13 commits July 2, 2026 01:59
Exercise test-mode workload PKI, Traefik websecure mTLS termination, adapter
enroll, and session-bound client certificates without relying on governance headers.

Co-authored-by: Cursor <cursoragent@cursor.com>
The terminate-and-re-encrypt mTLS model serves traffic through a Traefik
IngressRoute and deletes the legacy passthrough IngressRouteTCP, but
checkIngressReady still looked for the deleted IngressRouteTCP. That left
every mTLS MCPServer stuck at PartiallyReady (ingressReady never true),
which made the mTLS datapath e2e scenario time out waiting for phase
Ready. Point the readiness check at the IngressRoute and cover both the
ready and not-ready cases.

Also address e2e review feedback in scenarios/mtls.sh:
- drop curl -f on the spoofed-headers check so the gateway's 4xx/5xx
  body is captured and validated instead of silently skipped;
- fail fast with a clear message if either the gateway-mtls or mtls-ca
  secret is missing after the wait loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mtls datapath drives traffic through Traefik's websecure entrypoint
(TRAEFIK_TLS_PORT -> 8443); the plaintext web port-forward (18080) is
never used by the scenario. Calling ensure_traefik_port_forward made the
scenario time out waiting on localhost:18080 deep into the run for a
port it doesn't need. Keep only the websecure port-forward.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wait_for_policy_text hardcoded the default ${SERVER_NAME}-gateway-policy
ConfigMap, so the mtls scenario's waits for its grant and session never
observed mtls-mcp-server-gateway-policy and timed out. Add an optional
server argument (default SERVER_NAME) and pass MTLS_SERVER_NAME from the
mtls scenario.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mcp-sentinel-config ConfigMap (rendered by renderAnalyticsConfigManifest)
never received MCP_MTLS_CLUSTER_ISSUER, so runtime-api — which consumes it via
envFrom and uses it to sign adapter/session CSRs — saw an empty issuer and
returned "503 workload certificate issuer is not configured" on adapter enroll.
The replacement only lived in renderAnalyticsManifest, which handles the other
manifests that don't carry the key. Inject the issuer in the config renderer and
drop the dead replacement.

Tighten the mtls e2e check to assert the issuer on the ConfigMap (the real
source via envFrom) instead of an inline env var that never existed, so a
propagation regression fails fast with a clear message.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erts

HandleAdapterCertificate base64-encoded the raw CSR DER into the
CertificateRequest spec.request, but cert-manager's admission webhook
requires a PEM-encoded CSR and rejected it with "error decoding
certificate request PEM block" — surfacing as "503 issue adapter
certificate" on `adapter enroll` and breaking the mTLS datapath.
PEM-encode the CSR before base64. Verified against a live mcp-runtime-ca
issuer (request now issues) and covered by a regression test asserting
spec.request decodes to a PEM CERTIFICATE REQUEST.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
An mtls scenario failure (e.g. adapter cert issuance) previously left no
server-side evidence in the CI artifact bundle, forcing local repro to
find the root cause. Add an ERR-trap diagnostics collector (with set -E
so it fires inside functions and the adapter-enroll command substitution)
that dumps the MCPServer, CertificateRequests, certs/secrets, sentinel
ConfigMap, and runtime-api/platform-api/mtls-server logs into WORKDIR,
which the EXIT trap archives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The session-name regex used [^\\s]+, which (double-escaped) matched "not
backslash or the letter s" rather than non-whitespace. Once adapter
enroll started succeeding, its `.../session/<name> (expires ...)` output
made the capture swallow " (expire" up to the s in "expires", so the
policy wait searched for a garbled name and timed out. Match the session
token explicitly ([A-Za-z0-9._-]+).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The datapath curls connected to https://127.0.0.1:<port> (SNI 127.0.0.1),
so Traefik served its default cert and did not bind the router's
client-cert TLS options to that handshake — server verification failed
(curl exit 60) and mTLS enforcement wasn't exercised. Give the mtls
server an ingressHost so the operator emits a Host()-scoped IngressRoute,
reach it via curl --resolve so the SNI selects that router (and its
client-cert options), and use -k because the caller-facing server cert is
Traefik's self-signed default in HTTP test-mode (no default TLSStore).
The scenario still verifies the client-cert mTLS property: no client cert
is rejected, the session-bound cert is accepted, spoofed headers denied.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…overlay

The mtls terminate-and-re-encrypt datapath is delivered through Traefik
CRDs (IngressRoute/Middleware/TLSOption/ServersTransport) and the
spiffe-identity local plugin, but the http overlay's args patch replaced
the base args and dropped --providers.kubernetescrd, the websecure TLS
default, and the spiffe-identity plugin registration. Traefik therefore
ignored every IngressRoute the operator created and the mtls host 404'd.
Re-add the CRD provider (scoped to the server namespaces), websecure
http.tls, and the spiffe-identity plugin module while keeping web
plaintext for the HTTP test flows. Verified on a live cluster: with the
CRD provider enabled Traefik loads the IngressRoute and routes the host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture the Traefik routing layer (IngressRoute/Middleware/TLSOption/
ServersTransport CRs, the Traefik deployment and controller logs), the
mcp-servers workload/pod describe/events, and a verbose curl replay of the
datapath request (status, headers, body) into the archived artifact
bundle, so a routing failure (404 vs 502 vs deny) is diagnosable from CI
instead of requiring local repro.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Widening the deadline wasn't enough: watch() captures its baseline
modtime in its own goroutine, so under CI load it could start after the
test rewrote the cert, baseline on cert-2's modtime, and never observe a
change (the reload never fired, so it hit the deadline at ~10s). Push the
cert modtime strictly forward on every poll so the watcher sees a fresh
change no matter when its goroutine started, making the reload
deterministic instead of racing the goroutine start.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Traefik v2.10 starts informers for every IngressRoute/Middleware/TLSOption/
ServersTransport/... kind in BOTH traefik.io and the legacy
traefik.containo.us group whenever those CRDs are installed. The
traefik-ingressclass ClusterRole only granted a subset of traefik.io, so
list/watch on traefik.containo.us (and traefik.io udp/tcp kinds) was
forbidden. That stalls the kubernetescrd provider, so no IngressRoute is
loaded and mtls hosts fall through to Traefik's default cert and 404.
Grant get/list/watch on the full kind set across both groups. Verified on
a live cluster: the forbidden errors clear and the provider syncs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Agent-Hellboy Agent-Hellboy force-pushed the test/mtls_e2e_datapath branch from 505098a to ea70f47 Compare July 1, 2026 20:31
@Agent-Hellboy

Copy link
Copy Markdown
Owner Author

Addressed the review comments and rebased on latest main (resolves the merge conflict):

  • spoof-check curl -fsS → now uses -sS -k and captures the body so the deny is validated.
  • secret-wait → now checks both -gateway-mtls and -mtls-ca after the timeout loop.

While getting the datapath green end-to-end on the Kind e2e, several real gaps were found and fixed (each verified on a live cluster):

  • operator checkIngressReady tracked the deleted IngressRouteTCP instead of the IngressRoute → mtls servers stuck PartiallyReady;
  • runtime-api sent the adapter CSR as DER (cert-manager needs PEM) → 503 on adapter enroll;
  • MCP_MTLS_CLUSTER_ISSUER was never propagated into the sentinel ConfigMap;
  • http ingress overlay dropped --providers.kubernetescrd + the spiffe-identity plugin;
  • Traefik ClusterRole lacked RBAC for the traefik.containo.us CRD group (+ udp/tcp kinds), stalling the CRD provider.

Also expanded the e2e failure diagnostics (Traefik CRs/logs, verbose curl replay) so future routing failures are debuggable from the artifact bundle.

Agent-Hellboy and others added 7 commits July 2, 2026 02:52
…er port

Traefik's KubernetesCRD provider resolves IngressRoute service references by
looking up the port in Service.spec.ports. The mTLS IngressRoute was passing
gateway.port (8091 — the container/targetPort) instead of servicePort (80 —
the Service's exposed port). Traefik logged "service port not found: 8091"
and skipped the route entirely, falling back to a 404 with its default cert.

The correct value is ServicePort (80). Traefik discovers pod endpoints from
the Service and connects to each pod at the resolved targetPort (8091)
automatically — the IngressRoute only needs the Service-level port.

Adds a regression assertion to TestReconcileMTLSIngressGeneratesTraefikResources
and seeds ServicePort=80 in the shared mtlsServer() fixture.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When cert-manager issues the CA bundle secrets after MCPServer creation,
Traefik receives a watch event and reloads its TLS config (TLSOption and
ServersTransport). The reload briefly closes the websecure (8443) listener,
which breaks the port-forward with "broken pipe" and leaves port 18443
refusing connections.

Add recover_traefik_tls_port_forward_if_needed() to kind.sh (mirrors
recover_traefik_port_forward_if_needed for the HTTP port) and call it in
mtls.sh before the authenticated mTLS curl, which runs ~30s after the
initial port-forward start (grant + adapter-enroll round-trips in between).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… curls

Traefik applies TLSOption and ServersTransport atomically during a config
reload triggered by cert-manager issuing the CA bundle secret. Until that
reload completes, TLSOption lacks its CAFiles so RequireAndVerifyClientCert
is not enforced: the TLS handshake succeeds without a client cert, Traefik
routes the request via plain HTTP (ServersTransport not yet loaded), and the
TLS-only gateway returns HTTP 400.

Add wait_for_mtls_traefik_ready() in kind.sh that polls until a no-cert curl
exits 35 (TLS handshake failure = certificate_required alert). This proves
TLSOption is active; because both TLSOption and ServersTransport are reloaded
in the same Traefik config snapshot, exit-35 also implies ServersTransport is
using TLS to the backend. Call it in mtls.sh after the port-forward is up and
before the reject/accept curl sequence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ty check

In Kind clusters, kubectl port-forward exits on any TCP error from the upstream
pod — including the RST Traefik sends after a TLS certificate_required alert.
The previous wait_for_mtls_traefik_ready approach probed the websecure port with
curl, which triggered this RST on every iteration, creating an infinite
port-forward restart loop that always timed out (exit 56 each time).

Replace with wait_for_mtls_traefik_stable, which polls kubectl logs until no
level=error lines mentioning the server name appear in a 6s window. This proves
TLSOption (RequireAndVerifyClientCert + CA loaded) and ServersTransport (TLS to
gateway) are both applied without touching the websecure port at all. After the
stability wait, recover the port-forward once (Traefik's initial secret-loading
retries may have broken it) before proceeding to the reject/accept curl sequence.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without scheme: https, Traefik defaults to plain HTTP for backends on
port 80. The ServersTransport TLS config (CA, client cert, serverName)
is only applied when Traefik connects via HTTPS, so omitting the scheme
caused Traefik to forward requests over HTTP and the gateway responded
with 400 "Client sent an HTTP request to an HTTPS server".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The gateway reloads its policy from the volume-mounted ConfigMap every 5s.
wait_for_policy_text confirms the ConfigMap API is updated, but the kubelet
may not have propagated the change to the pod's volume mount yet, so the
next gateway reload can still read the old policy. Previously the accepting
initialize curl ran immediately after wait_for_policy_text, racing the
policy reload cycle and returning 401 session_not_found.

Retry up to 15 times (30s) until the response contains "result", with
port-forward recovery and a 2s sleep between attempts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
wait_for_policy_text checks the ConfigMap via the API server, but the
kubelet propagates ConfigMap updates to volume mounts on its own sync
period. On loaded Kind CI nodes this lag exceeded 30s, so all 15 curl
retries saw an stale policy file in the gateway pod and returned 401
session_not_found.

Add wait_for_gateway_policy_file: kubectl-execs into the gateway pod and
polls /var/run/mcp-runtime/policy/policy.json (the actual file the
gateway reloads every 5s) until the session name appears, with a 180s
deadline. The curl retry loop is kept as a safety net for the gateway's
5s reload tick after the file is updated.

Diagnostics improvements:
- Dump the in-pod policy file and ConfigMap side by side so future
  failures can distinguish kubelet propagation lag from operator bugs.
- Use fully-qualified group names for Traefik CRD dumps
  (traefik.io / traefik.containo.us) to avoid ambiguous bare names that
  return empty lists when both API groups are registered.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant