Skip to content

Skip gRPC DNS lookup when APISERVER_IP env var is set#180

Closed
djsly wants to merge 1 commit into
mainfrom
djsly/apiserver-ip-env-var
Closed

Skip gRPC DNS lookup when APISERVER_IP env var is set#180
djsly wants to merge 1 commit into
mainfrom
djsly/apiserver-ip-env-var

Conversation

@djsly

@djsly djsly commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Skip gRPC's built-in dns:/// resolver when a pre-resolved API server IP is provided via the APISERVER_IP environment variable. This eliminates infinite STLS bootstrap retries when DNS is broken on the node at provisioning time.

Why

When DNS is unhealthy at bootstrap time (CoreDNS not ready, systemd-resolved race, custom DNS misconfig), the STLS client logs

rpc error: code = DeadlineExceeded ... name resolver error: produced zero addresses

and retries forever. Reference incident: VMId b409a057-44a0-4b06-a6de-2e24e59a90a5 — 53+ hours of retries on Jun 5–6.

The node already knows the apiserver address at provisioning time (private clusters publish it via IMDS tag aksAPIServerIPAddress, public clusters can resolve it from getent at CSE time). We don't need to ask gRPC to re-resolve it on every retry.

What

  • New Config.APIServerIP field (json: apiServerIp).
  • applyDefaults() falls back to os.Getenv("APISERVER_IP") when the JSON field is empty. Invalid IP literals are cleared silently so a bad value never fails bootstrap — caller transparently falls back to FQDN dial.
  • getServiceClient() delegates target selection to a new getDialParams() helper:
    • IP set → dial passthrough:///<ip>:443, set tls.Config.ServerName = APIServerFQDN (SAN validation still hits the FQDN), and grpc.WithAuthority(<fqdn>:443) so the :authority header is unchanged.
    • IP empty → historical <fqdn>:443 dial preserved verbatim.

Backward compatibility matrix

AgentBaker CSE STLS client Behavior
Old (no env var) Old (no reader) Status quo — DNS dial on every retry.
Old (no env var) New (this PR) APISERVER_IP unset → FQDN dial. Identical to old.
New (writes env var) Old (no reader) Old binary ignores unknown env vars. FQDN dial. Identical to old.
New (writes env var) New (this PR) Fix activates — IP dial, no DNS at gRPC time.

All four cells safe across the 6-month VHD support window in both directions.

Tests

  • TestApplyDefaultsAPIServerIP — env-var pickup, JSON precedence, IPv6 accepted, invalid IP silently cleared (env and JSON paths), empty leaves empty, brackets rejected.
  • TestGetDialParams — FQDN-only path produces <fqdn>:443 + zero extra dial options; IPv4 produces passthrough:///<ip>:443 + 1 extra option (WithAuthority); IPv6 produces correct [<ip>]:443 bracketing.
  • TestToZapFields updated to include apiServerIp.
  • Full go test ./... and go vet ./... clean.

Companion change

AgentBaker PR adds APISERVER_IP=<ip> to /etc/default/secure-tls-bootstrap from configureAndStartSecureTLSBootstrapping. IP source order: IMDS tag aksAPIServerIPAddress (private clusters) → getent ahostsv4getent ahostsv6 → empty (fall back to FQDN dial). Link will be added once that PR is opened.

Tracking

ADO repair item: AB#38327357[STLS] Use node-local apiserver endpoint env var instead of DNS-resolving kube-apiserver-proxy
Parent feature: AB#34681743 (Cameron Meissner — STLS Phase 1)

🤖 Generated by GitHub Copilot

When STLS bootstrap runs on a node with a broken DNS path (CoreDNS not
ready, systemd-resolved race, custom DNS misconfig), the gRPC built-in
dns:/// resolver re-resolves the apiserver FQDN on every retry and the
client loops forever emitting 'name resolver error: produced zero
addresses' (reference incident VMId b409a057-44a0-4b06-a6de-2e24e59a90a5,
53+ hours of retries on Jun 5-6).

This change adds a pre-resolved IP escape hatch:

  * New Config field APIServerIP (json: apiServerIp).
  * applyDefaults() falls back to the APISERVER_IP environment variable
    when the json field is empty. Invalid IP literals are cleared
    silently so a bad value never fails bootstrap.
  * getServiceClient() now delegates target selection to a new
    getDialParams() helper. When APIServerIP is set we dial via
    'passthrough:///<ip>:443', set tls.Config.ServerName to the FQDN so
    SAN validation continues to match, and pass WithAuthority(<fqdn>:443)
    so the :authority header is unchanged from the FQDN path.
  * When APIServerIP is empty the historical FQDN-only dial is preserved
    verbatim — full backward compatibility with old AgentBaker CSE that
    does not yet set the env var.

Unit tests cover the new env-var pickup, json-precedence, IPv6
acceptance, and silent-clear-on-invalid behavior, plus assertions on the
getDialParams output for FQDN-only, IPv4, and IPv6 inputs.

Companion AgentBaker change (cse_config.sh) writes APISERVER_IP into
/etc/default/secure-tls-bootstrap from IMDS tags / getent.

Tracking: AB#38327357 (parent feature AB#34681743)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

PR Title Lint Failed ❌

Current Title: Skip gRPC DNS lookup when APISERVER_IP env var is set

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

@djsly

djsly commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Closing — fix not needed. We've decided not to pursue this approach after further discussion.

@djsly djsly closed this Jun 10, 2026
@djsly djsly deleted the djsly/apiserver-ip-env-var branch June 10, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant