OCPBUGS-64688: use ObservedGeneration to determine Progressing status#1169
OCPBUGS-64688: use ObservedGeneration to determine Progressing status#1169sg00dwin wants to merge 3 commits into
Conversation
Assisted-by: Claude Code (Opus 4.6)
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-64688, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThe change modifies SyncLoopRefresh to treat a Deployment as "Progressing" when its Status.ObservedGeneration is less than Generation by calling a new helper; tests were added to validate behavior across generation/observedGeneration combinations. ChangesDeployment Generation Observation Check
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sg00dwin The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-64688, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/cc @Leo6Leo Spinning cluster to do QE currently |
|
Manual Verification Deployed the PR build to a live OCP 4.22 cluster (3 control-plane nodes, Azure). Node drain (should stay Progressing=False): Drained a node hosting a console pod. Pod was evicted and rescheduled. Progressing stayed False throughout — lastTransitionTime never changed. Spec change (should trigger Progressing=True): Changed logLevel to TraceAll, bumping deployment generation 3→4. Progressing correctly flipped to True ("deployment generation 4 not yet observed (current: 3)"), then resolved to False 2s later. Both scenarios behave as expected. Currently just waiting for the CI to past before I give the verify label. |
| // counts — they fluctuate during external disruptions (node reboots) | ||
| // that the operator did not cause. See https://issues.redhat.com/browse/OCPBUGS-64688 | ||
| if actualDeployment.Status.ObservedGeneration < actualDeployment.Generation { | ||
| return fmt.Errorf("deployment generation %d not yet observed (current: %d)", |
There was a problem hiding this comment.
The parenthetical says "current" but passes ObservedGeneration. Consider changing the wording to make it more accurate:
| return fmt.Errorf("deployment generation %d not yet observed (current: %d)", | |
| return fmt.Errorf("deployment generation %d not yet observed (observed: %d)", |
There was a problem hiding this comment.
@sg00dwin Could you add some unit tests for the new behavior?
The old IsAvailableAndUpdated at least had its own unit tests in deployment_test.go. The new inline comparison ObservedGeneration < Generation has no test coverage. A table-driven test with cases like: - ObservedGeneration == Generation → no error (not progressing) - ObservedGeneration < Generation → returns error (progressing) - ObservedGeneration > Generation → no error (edge case, already processed)
would help prevent regressions.
There was a problem hiding this comment.
I'd also be considering adding an e2e to one of the upgrade jobs? If we can write one that isn't a giant pain.
There was a problem hiding this comment.
Unit tests: Added table-driven unit tests for the ObservedGeneration-based Progressing check covering: equal, less than, greater than, both zero, and initial rollout scenarios.
Regarding e2e test: Good idea. There's already a story tracking this under the same epic: CONSOLE-5188 (e2e testing automation). It's currently unassigned and empty description. Added this suggestion for adding a console-owned e2e assertion to one of the upgrade jobs.
- Fix error message wording: "current" → "observed" to accurately describe the ObservedGeneration value being reported - Add table-driven unit tests for the ObservedGeneration-based Progressing check covering: equal, less than, greater than, both zero, and initial rollout scenarios Assisted by: Claude (Opus 4.6)
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-64688, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/console/operator/sync_v400_test.go (1)
243-247: 💤 Low valueConsider whether this test case is necessary.
The case where
ObservedGeneration > Generationshould never occur in a real Kubernetes deployment—the controller sets ObservedGeneration to Generation when it processes a spec change, maintaining the invariant ObservedGeneration ≤ Generation. Testing this scenario adds coverage for an impossible state.If this is intentional defensive coding against malformed data, consider adding a comment explaining the rationale. Otherwise, this test case could be removed to keep the test focused on realistic scenarios.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/console/operator/sync_v400_test.go` around lines 243 - 247, The test case named "ObservedGeneration greater than Generation: not progressing" covers an impossible Kubernetes state (ObservedGeneration > Generation); either remove this test to keep tests realistic or, if you intentionally want defensive coverage, add a short comment above the test case explaining that this is defensive testing against malformed/external data and why it's kept (referencing the test case name and the table entry with generation: 2 and observedGeneration: 3 in sync_v400_test.go) so future readers understand the rationale.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/console/operator/sync_v400_test.go`:
- Around line 274-279: The test duplicates the ObservedGeneration check instead
of exercising the production logic; update tests to call the real check by
either (A) invoking the sync function in sync_v400.go that performs the
ObservedGeneration validation (so the test fails if that logic or error message
changes), or (B) extract the ObservedGeneration validation into a shared helper
(e.g., a new function like validateObservedGeneration or checkObservedGeneration
that takes the deployment and returns the same error) and use that helper from
both the production code in sync_v400.go and from this test; ensure the test
asserts the helper/sync function's returned error message rather than
reimplementing the fmt.Errorf logic and keep references to deployment.Generation
and deployment.Status.ObservedGeneration.
- Around line 3-12: Reorder the import block so it follows the standard groups:
keep standard library imports (fmt, testing) first, third-party imports
(github.com/go-test/deep) next, then Kubernetes/OpenShift imports (appsv1, v1,
metav1), and place the internal import
github.com/openshift/console-operator/pkg/api after the kube imports; update the
import grouping comments if present and ensure the import order matches the
documented convention referenced in the review.
---
Nitpick comments:
In `@pkg/console/operator/sync_v400_test.go`:
- Around line 243-247: The test case named "ObservedGeneration greater than
Generation: not progressing" covers an impossible Kubernetes state
(ObservedGeneration > Generation); either remove this test to keep tests
realistic or, if you intentionally want defensive coverage, add a short comment
above the test case explaining that this is defensive testing against
malformed/external data and why it's kept (referencing the test case name and
the table entry with generation: 2 and observedGeneration: 3 in
sync_v400_test.go) so future readers understand the rationale.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 17d399a0-d433-4f7a-879b-ec40a158724a
📒 Files selected for processing (2)
pkg/console/operator/sync_v400.gopkg/console/operator/sync_v400_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/console/operator/sync_v400.go
📜 Review details
🧰 Additional context used
📓 Path-based instructions (7)
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Follow Go coding standards and patterns documented in CONVENTIONS.md
Organize imports according to conventions documented in CONVENTIONS.md
Usegofmtto format Go code with standard formatting
Rungo vetchecks on all Go packagesFollow Go coding standards and patterns as documented in CONVENTIONS.md, including proper import organization
Organize Go code following the repository structure: main entry point in
cmd/console/main.go, API constants inpkg/api/, operator command setup inpkg/cmd/operator/, and version command inpkg/cmd/version/
**/*.go: Usegofmtfor formatting Go code
Follow standard Go naming conventions
Group imports in order: standard lib, 3rd party, kube/openshift, internal (marked with comments)
Use meaningful error messages with context in Go code
Set status conditions usingstatus.Handle*functions with type prefixes (*Degraded, *Progressing, *Available, *Upgradeable)
Use typed errors and wrap errors to preserve stack contextOpenShift Tests Extension (OTE) binaries communicate with openshift-tests via JSON on stdout. Flag non-JSON stdout from the main binary process in process-level code: main(), init(), TestMain(), BeforeSuite(), AfterSuite(), SynchronizedBeforeSuite(), RunSpecs() setup, or top-level var/const initializers. Common violations include klog writing to stdout (must redirect to stderr), fmt.Print/Println/Printf in main or suite setup, and Ginkgo v2 suite configuration emitting warnings to stdout
Files:
pkg/console/operator/sync_v400_test.go
⚙️ CodeRabbit configuration file
**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.Refer to the following skills based on CODE PATTERNS, not just file paths:
Refer to /controller-review when code contains:
- Controller struct types (e.g.,
type *Controller struct)func New*Controller(factory functionsfactory.New().WithFilteredEventsInformers(pattern.ToController(method callsSync(ctx context.Context, controllerContext factory.SyncContext)methodsoperatorConfig.Spec.ManagementStatechecksstatus.NewStatusHandlerorstatus.Handle*functionsRefer to /sync-handler-review when code contains:
- Main operator sync functions (e.g.,
sync_v400.gocontent)- Sequential resource syncing with early returns
- Incremental reconciliation loops
- Multiple
resourceapply.Apply*()calls in sequence- Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
- Feature gate conditional logic
Refer to /go-quality-review for all Go code to check:
- Deprecated imports:
ioutil.ReadFile,ioutil.WriteFile,ioutil.ReadAll- Deprecated patterns:
DialwithoutDialContext- Error handling: missing
%win fmt.Errorf- Code smells: deep nesting (4+ levels), functions >100 lines
- Magic values: unexplained numbers/strings
- Context propagation:
context.Background()instead of passed ctx- Missing godoc on exported functions
Files:
pkg/console/operator/sync_v400_test.go
**/*_test.go
📄 CodeRabbit inference engine (AGENTS.md)
Follow testing patterns and commands documented in TESTING.md
Follow testing patterns and commands as documented in TESTING.md, including running unit tests with 'make test-unit' and checks with 'make check'
**/*_test.go: Use table-driven tests for comprehensive coverage
Usehttptestfor HTTP handler testing in Go
Include proper cleanup functions in tests
Test both success and failure paths
Files:
pkg/console/operator/sync_v400_test.go
⚙️ CodeRabbit configuration file
**/*_test.go: Review test code for quality and patterns.Refer to /unit-test-review when test is in pkg//*_test.go:**
- Table-driven test structure with test cases
- Use of
go-test/deepfor struct comparisons- Test naming conventions (TestFunctionName)
- Error handling with
wantErrpattern- Edge case coverage (nil, empty, boundary values)
- Proper assertions with helpful error messages
- Test isolation (no shared mutable state)
Refer to /e2e-test-review when test contains:
framework.MustNewClientset(t, nil)or similar e2e framework usagewait.Pollorwait.PollImmediatepatternsretry.RetryOnConflictfor updates- Cleanup via
deferfunctions- Console/operator CR manipulations
- Test assertions on cluster state
Suggest to use /e2e-test-review when:
- PR adds new feature requiring e2e coverage
- Test file is empty or skeleton
- Comments indicate "TODO: add test"
Review for common issues:
- Missing cleanup (defer statements)
- Using
time.Sleepinstead ofwait.Poll- Missing context timeouts
- Vague error messages in assertions
- Tests without table-driven structure when testing multiple cases
- Ignoring errors with
_- Tests without assertions
Files:
pkg/console/operator/sync_v400_test.go
{pkg,cmd}/**/*.go
📄 CodeRabbit inference engine (CLAUDE.md)
Use gofmt for code formatting on pkg and cmd directories
{pkg,cmd}/**/*.go: Format code usinggofmt -w ./pkg ./cmd
Rungo vetchecks on all Go packages in ./pkg and ./cmd
Files:
pkg/console/operator/sync_v400_test.go
**/*sync*.go
📄 CodeRabbit inference engine (CONVENTIONS.md)
Implement sync loops (
sync_v400) incrementally: start from zero, create/update missing requirements, and return to continue on next loop
Files:
pkg/console/operator/sync_v400_test.go
**/*.{go,java,py,js,ts,jsx,tsx,cs,cpp,c,rb,php}
📄 CodeRabbit inference engine (Custom checks)
**/*.{go,java,py,js,ts,jsx,tsx,cs,cpp,c,rb,php}: Flag logging that may expose passwords, tokens, API keys, PII (email, SSN, credit card), session IDs, internal hostnames, or customer data
Flag MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB mode usage. Flag custom crypto implementations. Flag non-constant-time comparison of secrets or tokens
Files:
pkg/console/operator/sync_v400_test.go
**/{deploy,config,operator,controllers}/**/*.{yaml,yml,go}
📄 CodeRabbit inference engine (Custom checks)
**/{deploy,config,operator,controllers}/**/*.{yaml,yml,go}: When deployment manifests, operator code, or controllers are added or modified, check for scheduling constraints that assume a standard HA topology without topology-awareness. Flag required pod anti-affinity (requiredDuringSchedulingIgnoredDuringExecution with topologyKey: kubernetes.io/hostname) that breaks on SNO, TNF with replicas > 2, and TNA
When deployment manifests, operator code, or controllers are added or modified, check for pod topology spread constraints with whenUnsatisfiable: DoNotSchedule and hostname topology key that breaks on SNO and is problematic when maxSkew exceeds the number of schedulable nodes on TNF and TNA
When deployment manifests, operator code, or controllers are added or modified, check for replica count derived from node count without considering SNO, TNF, and TNA topology variations where fewer than 3 control-plane nodes exist, and TNA's arbiter node should not run general workloads
When deployment manifests, operator code, or controllers are added or modified, check for nodeSelector or node affinity targeting control-plane nodes (node-role.kubernetes.io/master or node-role.kubernetes.io/control-plane) that will fail on HyperShift where no nodes carry these labels
When deployment manifests, operator code, or controllers are added or modified, check for scheduling workloads to all control-plane nodes equally without excluding arbiter nodes. Flag broad or wildcard tolerations that inadvertently schedule to resource-constrained arbiter nodes on TNA clusters
When deployment manifests, operator code, or controllers are added or modified, check for assumptions that dedicated worker nodes exist. Code targeting only worker nodes via node-role.kubernetes.io/worker nodeSelector may need to also consider nodes that carry both control-plane and worker roles on SNO and TNF
When deployment manifests, operator code, or controllers are added or modified, check PodDisruptionBudgets designe...
Files:
pkg/console/operator/sync_v400_test.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}
⚙️ CodeRabbit configuration file
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):
- SQL: parameterized queries only; no string concatenation
- Command: no shell=True, os.system, or backtick exec with user input
- LDAP/XPath: escape special characters in filters
- Path traversal: canonicalize paths, reject ../
- Deserialization: no pickle/yaml.load()/eval on untrusted data
- Prototype pollution: no recursive merge of untrusted objects
- Validate at trust boundaries with allow-lists, not deny-lists
- Normalize Unicode and anchor regexes (^$); watch for ReDoS
Files:
pkg/console/operator/sync_v400_test.go
- Fix error message wording: "current" → "observed" - Extract checkDeploymentGenerationProgress helper for testability - Add unit tests for ObservedGeneration-based Progressing check - Fix import ordering per project conventions Assisted by: Claude (Opus 4.6)
|
@sg00dwin: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
IsAvailableAndUpdatedwithObservedGeneration < Generationcheck for the Progressing conditionIsAvailableAndUpdatedusedUpdatedReplicas == Replicas, which falsely triggers during node reboots when pods are temporarily disrupted — even though the operator made no spec changesObservedGeneration < Generationonly fires when the operator has updated the deployment spec and the deployment controller hasn't processed it yetcheckDeploymentGenerationProgresshelper for testability and add unit testsRoot Cause
During node reboots (MCO draining nodes), console pods are terminated and replacements are created. During the ~30-second gap,
UpdatedReplicas != Replicas, causingIsAvailableAndUpdatedto return false and the operator to reportProgressing=True. The operator made no changes — the disruption is external.Why This Fix
ObservedGeneration < Generationprecisely detects operator-initiated spec changes:The Available condition (
IsAvailableat line 231) independently monitors pod readiness and is unaffected.Test plan
makebuilds successfullymake test-unitall tests passgofmt/go vetcleancheckDeploymentGenerationProgress: ObservedGeneration equal, less than, greater than Generation, both zero, and initial rollout[Monitor:legacy-cvo-invariants][bz-Management Console] clusteroperator/console should stay Progressing=False while MCO is Progressing=Trueshould stop failing (~8% failure rate on Sippy currently)Assisted-by: Claude Code (Opus 4.6)
Summary by CodeRabbit
Bug Fixes
Tests