Skip gRPC DNS lookup when APISERVER_IP env var is set#180
Closed
djsly wants to merge 1 commit into
Closed
Conversation
When STLS bootstrap runs on a node with a broken DNS path (CoreDNS not
ready, systemd-resolved race, custom DNS misconfig), the gRPC built-in
dns:/// resolver re-resolves the apiserver FQDN on every retry and the
client loops forever emitting 'name resolver error: produced zero
addresses' (reference incident VMId b409a057-44a0-4b06-a6de-2e24e59a90a5,
53+ hours of retries on Jun 5-6).
This change adds a pre-resolved IP escape hatch:
* New Config field APIServerIP (json: apiServerIp).
* applyDefaults() falls back to the APISERVER_IP environment variable
when the json field is empty. Invalid IP literals are cleared
silently so a bad value never fails bootstrap.
* getServiceClient() now delegates target selection to a new
getDialParams() helper. When APIServerIP is set we dial via
'passthrough:///<ip>:443', set tls.Config.ServerName to the FQDN so
SAN validation continues to match, and pass WithAuthority(<fqdn>:443)
so the :authority header is unchanged from the FQDN path.
* When APIServerIP is empty the historical FQDN-only dial is preserved
verbatim — full backward compatibility with old AgentBaker CSE that
does not yet set the env var.
Unit tests cover the new env-var pickup, json-precedence, IPv6
acceptance, and silent-clear-on-invalid behavior, plus assertions on the
getDialParams output for FQDN-only, IPv4, and IPv6 inputs.
Companion AgentBaker change (cse_config.sh) writes APISERVER_IP into
/etc/default/secure-tls-bootstrap from IMDS tags / getent.
Tracking: AB#38327357 (parent feature AB#34681743)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
Contributor
Author
|
Closing — fix not needed. We've decided not to pursue this approach after further discussion. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Skip gRPC's built-in
dns:///resolver when a pre-resolved API server IP is provided via theAPISERVER_IPenvironment variable. This eliminates infinite STLS bootstrap retries when DNS is broken on the node at provisioning time.Why
When DNS is unhealthy at bootstrap time (CoreDNS not ready, systemd-resolved race, custom DNS misconfig), the STLS client logs
and retries forever. Reference incident: VMId
b409a057-44a0-4b06-a6de-2e24e59a90a5— 53+ hours of retries on Jun 5–6.The node already knows the apiserver address at provisioning time (private clusters publish it via IMDS tag
aksAPIServerIPAddress, public clusters can resolve it fromgetentat CSE time). We don't need to ask gRPC to re-resolve it on every retry.What
Config.APIServerIPfield (json: apiServerIp).applyDefaults()falls back toos.Getenv("APISERVER_IP")when the JSON field is empty. Invalid IP literals are cleared silently so a bad value never fails bootstrap — caller transparently falls back to FQDN dial.getServiceClient()delegates target selection to a newgetDialParams()helper:passthrough:///<ip>:443, settls.Config.ServerName = APIServerFQDN(SAN validation still hits the FQDN), andgrpc.WithAuthority(<fqdn>:443)so the:authorityheader is unchanged.<fqdn>:443dial preserved verbatim.Backward compatibility matrix
APISERVER_IPunset → FQDN dial. Identical to old.All four cells safe across the 6-month VHD support window in both directions.
Tests
TestApplyDefaultsAPIServerIP— env-var pickup, JSON precedence, IPv6 accepted, invalid IP silently cleared (env and JSON paths), empty leaves empty, brackets rejected.TestGetDialParams— FQDN-only path produces<fqdn>:443+ zero extra dial options; IPv4 producespassthrough:///<ip>:443+ 1 extra option (WithAuthority); IPv6 produces correct[<ip>]:443bracketing.TestToZapFieldsupdated to includeapiServerIp.go test ./...andgo vet ./...clean.Companion change
AgentBaker PR adds
APISERVER_IP=<ip>to/etc/default/secure-tls-bootstrapfromconfigureAndStartSecureTLSBootstrapping. IP source order: IMDS tagaksAPIServerIPAddress(private clusters) →getent ahostsv4→getent ahostsv6→ empty (fall back to FQDN dial). Link will be added once that PR is opened.Tracking
ADO repair item: AB#38327357 —
[STLS] Use node-local apiserver endpoint env var instead of DNS-resolving kube-apiserver-proxyParent feature: AB#34681743 (Cameron Meissner — STLS Phase 1)
🤖 Generated by GitHub Copilot