Skip to content

test: add kwok test and pprof endpoint for diagnosing memory issues#501

Open
jmclong wants to merge 2 commits into
mainfrom
dev/jlong/bump-csi-provisioner-mem
Open

test: add kwok test and pprof endpoint for diagnosing memory issues#501
jmclong wants to merge 2 commits into
mainfrom
dev/jlong/bump-csi-provisioner-mem

Conversation

@jmclong

@jmclong jmclong commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces support for running end-to-end scale tests using a kwok-tuned kind cluster, alongside several improvements to observability and test infrastructure. The most significant changes are the addition of the Kwok cluster type and associated Makefile targets, enhancements to pprof profiling configuration for both the driver and manager, and improvements to test job configuration and cluster management.

Kwok Cluster Support and Test Infrastructure:

  • Added a new KwokCluster class to .github/workflows/scripts/run_tests.py to manage kwok-tuned kind clusters for scale testing, including setup and cleanup logic. The cluster is created with make kwok-bootstrap and is designed for high-scale fake node scenarios.
  • Updated the test matrix and job definitions in .github/workflows/test-e2e-pr.yml to add a "Kwok Scale" test job, which uses the new kwok cluster type and appropriate Makefile targets and environment variables. This ensures scale tests run in CI using the kwok-tuned cluster. [1] [2]
  • Extended the test runner and argument parsing to support the kwok cluster type, allowing selection via the --cluster-type flag. [1] [2]

Makefile Enhancements for Kwok and Cluster Management:

  • Added new Makefile targets for kwok cluster management: kwok-cluster, kwok-bootstrap, kwok, and kwok-uninstall, as well as improved cluster creation logic for kind and kwok clusters. These targets streamline setup, teardown, and installation of kwok and its dependencies. [1] [2]
  • Improved Makefile logic for loading Docker images, deploying with Helm, and managing the PATH for local binaries, ensuring compatibility with kwok and kind clusters. [1] [2] [3]

Observability Improvements:

  • Added support for configuring pprof profiling for both the driver and manager via new Helm values (pprof.enabled and pprof.port) and command-line flags (--pprof-bind-address). The deployment templates and Go binaries now expose pprof endpoints when enabled, aiding in performance diagnostics. [1] [2] [3] [4] [5] [6] [7] [8] [9]

Other Improvements and Fixes:

  • Updated test and cluster setup scripts to ensure environment consistency, such as setting the PATH for local binaries and improving comments for clarity. [1] [2]
  • Adjusted Helm deployment defaults to use IfNotPresent pull policy for images, optimizing for local development and CI runs.

These changes collectively enable scalable, efficient CI testing using kwok, improve profiling capabilities, and streamline cluster management for both development and automated testing.

@jmclong jmclong force-pushed the dev/jlong/bump-csi-provisioner-mem branch 10 times, most recently from 9e27808 to fb87a54 Compare June 5, 2026 19:28
@jmclong jmclong changed the title Dev/jlong/bump csi provisioner mem test: add kwok test and pprof endpoint for diagnosing memory issues Jun 8, 2026
@jmclong jmclong marked this pull request as ready for review June 8, 2026 12:53
@jmclong jmclong requested review from a team, croomes and landreasyan as code owners June 8, 2026 12:53
Comment thread internal/csi/controller/controller.go Fixed
jmclong and others added 2 commits June 8, 2026 17:30
Add --pprof-bind-address flag to both binaries, wired into
controller-runtime's PprofBindAddress option. Empty value (default)
disables pprof; setting an address (e.g. :6060) exposes /debug/pprof/*
for heap, goroutine, and CPU profiling.

Helm chart exposes observability.{driver,manager}.pprof.{enabled,port}
(disabled by default, port 6060). Useful for debugging memory issues
such as OOMKill investigations during scale and kwok tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmclong jmclong force-pushed the dev/jlong/bump-csi-provisioner-mem branch from c6a561c to 27ca839 Compare June 8, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants