Skip to content

feat: add nodepool_provisioning_duration metric#40

Open
Asrarfarooq wants to merge 2 commits into
nstogner:mainfrom
Asrarfarooq:nodepool_provisioning_duration
Open

feat: add nodepool_provisioning_duration metric#40
Asrarfarooq wants to merge 2 commits into
nstogner:mainfrom
Asrarfarooq:nodepool_provisioning_duration

Conversation

@Asrarfarooq

@Asrarfarooq Asrarfarooq commented May 4, 2026

Copy link
Copy Markdown
Contributor

Title

feat: add nodepool_provisioning_duration metric

Description

This PR introduces a new metric, nodepool_provisioning_duration, to track and report the time taken for GKE NodePools to become ready after creation. This helps in monitoring provisioning latency and identifying slow scaling events.

Results

We verified the new metric on a live cluster (megamon-test-cluster) by creating test TPU node pools and observing the output.

Raw Metrics Output

Curling the metrics endpoint (:8080/metrics) confirmed that the metric is emitted correctly with the expected labels and values:

# HELP megamon_alpha_nodepool_provisioning_duration_seconds Time spent provisioning.
# TYPE megamon_alpha_nodepool_provisioning_duration_seconds gauge
megamon_alpha_nodepool_provisioning_duration_seconds{nodepool_name="tpu-test-pool-1775690156",otel_scope_name="megamon",otel_scope_version="",provisioning_state="success",tpu_accelerator="tpu-v4-podslice",tpu_topology="2x2x1"} 215.009644856
megamon_alpha_nodepool_provisioning_duration_seconds{nodepool_name="tpu-validation-pool-1763155065",otel_scope_name="megamon",otel_scope_version="",provisioning_state="success",tpu_accelerator="tpu-v4-podslice",tpu_topology="2x2x1"} 190.008424911

Controller Logs

The logs confirm the internal calculation of provisioningDuration (in nanoseconds) matching the external metrics:

{
  "type": "nodepools",
  "summary": {
    "downTimeProvisioned": 190008424911,
    "provisioningDuration": 190008424911,
    "provisioningState": "success"
  }
}

These results confirm that:

  • The metric is correctly registered and emitted.
  • It correctly tracks successful provisioning durations (e.g., ~190s and ~215s).
  • It attaches the correct labels for topology and accelerator type.

Key Changes

  • internal/records/events.go: Added logic to calculate ProvisioningDuration and determine ProvisioningState ("provisioning", "success", or "failed").
  • internal/metrics/metrics.go: Registered and implemented the emission of the new nodepool_provisioning_duration metric.
  • internal/records/events_test.go: Added test cases in TestSummarize to verify correct calculation of provisioning duration and state.
  • test/integration/cases_test.go: Updated integration tests to verify metric emission in simulated scenarios.

Context

This branch has been rebased on top of the latest main (which includes the consolidated workload reconciler changes from PR #39) to ensure a clean and isolated diff containing only this feature. Conflict in events_test.go was resolved by combining assertions from both branches.


Updated internal/records/events_test.go to expect ProvisioningDuration: 1h and ProvisioningState: "success" for older test cases that calculate these values but previously defaulted to zero values in the test table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant