OCPBUGS-62629: reflect workload controller changes in the deployment …#2196
OCPBUGS-62629: reflect workload controller changes in the deployment …#2196tjungblu wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@tjungblu: This pull request references Jira Issue OCPBUGS-62629, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold |
|
Note Currently processing new changes in this PR. This may take a few minutes, please wait... ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (5)
WalkthroughReworks deployment degraded-condition computation and diagnostics: exports container-message builder, refactors container-status helpers, rewrites syncManaged to compute Available/Progressing/Degraded from the applied deployment and matched pods, changes sync wiring for sync-error degraded behavior, updates deletion routing, and expands tests to assert degraded reasons/messages. ChangesDegraded Condition Handling for Deployments
Sequence Diagram(s)sequenceDiagram
participant Controller
participant KubeAPI as "Kube API (Deployment)"
participant PodAPI as "Kube API (Pods)"
participant OperatorStatus
Controller->>KubeAPI: Apply deployment (returns deploymentObj)
Controller->>PodAPI: List pods matching deploymentObj.Spec.Selector
PodAPI-->>Controller: Pod list
Controller->>Controller: compute Available / Progressing (hasDeploymentProgressed / timed out)
Controller->>Controller: filter non-deleting pods, detect failing pods, build container messages (ContainerMessagesForPods)
alt progressing timed out
Controller->>OperatorStatus: Set Degraded=True (Reason=ProgressDeadlineExceeded)
else failing/unavailable pods (not mid-rollout)
Controller->>OperatorStatus: Set Degraded=True (Reason=UnavailablePod) with container messages
else no errors
Controller->>OperatorStatus: Set Available/Progressing accordingly (Degraded=False)
end
Note right of Controller: If sync returns error and controller owns Degraded, applySyncErrorDegraded runs to mark Degraded=SyncError
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (1 warning, 2 inconclusive)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/operator/deploymentcontroller/deployment_controller_test.go (1)
456-671: ⚡ Quick winAdd one regression case that asserts the new reason/message fields.
These table updates prove the boolean transitions, but the behavior added here is really the new
UnavailablePod/ProgressDeadlineExceeded/SyncErrorcondition semantics. One focused case that keeps and assertsReason/Messagewould catch the exact regressions this PR is trying to prevent.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/deploymentcontroller/deployment_controller_test.go` around lines 456 - 671, Add a single regression test case to the TestSync testCases table that specifically asserts Reason and Message on operator conditions (e.g. UnavailablePod, ProgressDeadlineExceeded, SyncError). Create a case (name like "assert condition reason/message") that sets up initialObjects using makeDeployment with appropriate withDeploymentStatus and withDeploymentConditions to trigger the target condition, and an operator via makeFakeOperatorInstance; then in expectedObjects.operator include an assertion that the condition entries include the exact Reason and Message (extend or use a helper similar to withTrueConditions/withFalseConditions but that checks Reason/Message, or add a new helper e.g. withConditionsIncludingReasonMessage) so the test fails if Reason/Message are not set as intended; insert this new case into the existing testCases slice inside TestSync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/apps/deployment/status.go`:
- Line 41: The loop currently appends pods[i].Status.InitContainerStatuses onto
pods[i].Status.ContainerStatuses which can mutate cached pod objects; instead
create a fresh combined slice (e.g., combined := make([]v1.ContainerStatus, 0,
len(pods[i].Status.ContainerStatuses)+len(pods[i].Status.InitContainerStatuses))
and append both slices into it) or iterate pods[i].Status.ContainerStatuses and
pods[i].Status.InitContainerStatuses in two separate loops; update the loop that
references the combined view to use this new local variable (referencing pods,
pods[i].Status.ContainerStatuses, pods[i].Status.InitContainerStatuses) so you
never modify the original pod object.
In `@pkg/operator/deploymentcontroller/deployment_controller.go`:
- Around line 381-388: The degraded message is built from the full operandPods
slice while hasFailingPods() intentionally ignores terminating/deleting pods,
causing mismatched details; modify the code around the decision branch (when
!workloadIsBeingUpdated && deploymentObj.Status.AvailableReplicas <
desiredReplicas) to first filter operandPods to remove deleting/terminating pods
into a new slice (e.g., livePods), pass that filtered slice into
hasFailingPods(deploymentObj, livePods, now) and into
deployment.ContainerMessagesForPods(deploymentObj, livePods) so both the failure
check and the generated container messages use the identical pod set.
---
Nitpick comments:
In `@pkg/operator/deploymentcontroller/deployment_controller_test.go`:
- Around line 456-671: Add a single regression test case to the TestSync
testCases table that specifically asserts Reason and Message on operator
conditions (e.g. UnavailablePod, ProgressDeadlineExceeded, SyncError). Create a
case (name like "assert condition reason/message") that sets up initialObjects
using makeDeployment with appropriate withDeploymentStatus and
withDeploymentConditions to trigger the target condition, and an operator via
makeFakeOperatorInstance; then in expectedObjects.operator include an assertion
that the condition entries include the exact Reason and Message (extend or use a
helper similar to withTrueConditions/withFalseConditions but that checks
Reason/Message, or add a new helper e.g. withConditionsIncludingReasonMessage)
so the test fails if Reason/Message are not set as intended; insert this new
case into the existing testCases slice inside TestSync.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 77692bf7-5ee0-4146-af3f-307cedb5c776
📒 Files selected for processing (4)
pkg/apps/deployment/status.gopkg/operator/csi/csidrivercontrollerservicecontroller/csi_driver_controller_service_controller.gopkg/operator/deploymentcontroller/deployment_controller.gopkg/operator/deploymentcontroller/deployment_controller_test.go
07ea5a5 to
012e48a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/deploymentcontroller/deployment_controller_test.go (1)
673-709: ⚡ Quick winAdd one regression case for the new
SyncErrordegraded path.The suite now covers timeout and unavailable-pod degradation, but the wiring change in this PR is the deferred
applySyncErrorDegraded()path insync(). Please add a case that forcesgetDeployment()or the new pod lookup to fail and assertsconditionDegraded=TruewithReason="SyncError", so theWithSyncDegradedOnErrorinversion stays protected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/deploymentcontroller/deployment_controller_test.go` around lines 673 - 709, Add a regression test case in deployment_controller_test.go that exercises the deferred applySyncErrorDegraded() path by forcing getDeployment() or the pod lookup to return an error so sync() triggers the SyncError degraded branch; create a new test entry (parallel to the existing "ProgressDeadlineExceeded" case) that sets up initialObjects to produce a failure (e.g., mock getDeployment/pod list to error or supply a nil/erroneous object) and in assertOperatorStatus verify that conditionDegraded is present with Status True and Reason "SyncError" (and an appropriate message), ensuring the WithSyncDegradedOnError inversion remains protected and referencing sync(), applySyncErrorDegraded(), and WithSyncDegradedOnError for locating the logic to target.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/deploymentcontroller/deployment_controller.go`:
- Around line 296-313: The current reconcile eagerly lists pods (using
c.kubeClient.CoreV1().Pods(...).List) whenever the controller has the Degraded
condition (slices.Contains(c.conditions, opv1.OperatorStatusTypeDegraded)),
causing transient list failures to surface as hard sync errors; move the pod
fetch logic (creation of operandPods from podList.Items and
metav1.LabelSelectorAsSelector on deploymentObj.Spec.Selector) into the specific
branch that handles UnavailablePod/container-failure diagnostics (where
container details are inspected), and change error handling so that pod list
failures are treated as best-effort diagnostics (log/record the lookup failure
but do not return error from syncManaged or the reconcile path). Ensure you
still validate deploymentObj.Spec.Selector before using it and only attempt the
pods.List when you actually need pod/container details.
---
Nitpick comments:
In `@pkg/operator/deploymentcontroller/deployment_controller_test.go`:
- Around line 673-709: Add a regression test case in
deployment_controller_test.go that exercises the deferred
applySyncErrorDegraded() path by forcing getDeployment() or the pod lookup to
return an error so sync() triggers the SyncError degraded branch; create a new
test entry (parallel to the existing "ProgressDeadlineExceeded" case) that sets
up initialObjects to produce a failure (e.g., mock getDeployment/pod list to
error or supply a nil/erroneous object) and in assertOperatorStatus verify that
conditionDegraded is present with Status True and Reason "SyncError" (and an
appropriate message), ensuring the WithSyncDegradedOnError inversion remains
protected and referencing sync(), applySyncErrorDegraded(), and
WithSyncDegradedOnError for locating the logic to target.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 884898b7-e452-48ae-8bbd-18cbb0f259c8
📒 Files selected for processing (5)
pkg/apps/deployment/status.gopkg/operator/csi/csidrivercontrollerservicecontroller/csi_driver_controller_service_controller.gopkg/operator/deploymentcontroller/deployment_controller.gopkg/operator/deploymentcontroller/deployment_controller_test.gopkg/operator/deploymentcontroller/helpers_test.go
✅ Files skipped from review due to trivial changes (2)
- pkg/operator/csi/csidrivercontrollerservicecontroller/csi_driver_controller_service_controller.go
- pkg/operator/deploymentcontroller/helpers_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/apps/deployment/status.go
012e48a to
7d80de2
Compare
…controller Contains the degraded/progressing condition changes made to the workload controller from openshift#2128. Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
7d80de2 to
c44884e
Compare
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/deploymentcontroller/deployment_controller_test.go (1)
676-739: ⚡ Quick winAdd one regression case for pod-diagnostics failure in the unavailable-pod path.
The new assertions cover
ProgressDeadlineExceededand deferredSyncError, but not the branch wherePods().List()fails while the deployment is partially unavailable. That path now affects whetherDegradedstays true or gets cleared, so it should be locked down withassertOperatorStatus.Also applies to: 796-798
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/deploymentcontroller/deployment_controller_test.go` around lines 676 - 739, Add a regression test case that simulates a failure from Pods().List() in the unavailable-pod path and asserts operator Degraded remains set (or not cleared) accordingly; specifically, add a new test entry in the table (similar style to the existing "ProgressDeadlineExceeded" and "sync error" cases) that injects a fake client error for Pods().List() when getDeployment()/the unavailable-pod handling runs, then provide an assertOperatorStatus closure that checks v1helpers.FindOperatorCondition(status.Conditions, conditionDegraded) is non-nil and has the expected Status/Reason/Message (matching the desired behavior for the unavailable-pod error path). Ensure the test uses the same test harness helpers (makeDeployment, makeFakeOperatorInstance, optionalManifestHooks if needed) and names the case to indicate "pod-diagnostics failure in unavailable-pod" so it covers the branch referenced in getDeployment()/applySyncErrorDegraded logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/deploymentcontroller/deployment_controller.go`:
- Around line 362-367: When AvailableReplicas < desiredReplicas and rollout is
not updating, don't let a nil/failed pod lookup from
listOperandPodsForDiagnostics clear Degraded; treat missing operandPods as a
failure. In the branch around listOperandPodsForDiagnostics -> operandPods, set
hasFailing to true if operandPods is nil or the diagnostics list indicates a
selector/list error (in addition to the existing hasFailingPods(deployment,
livePods, now) check), and still build containerMessages with whatever
diagnostics are available via dpm.ContainerMessagesForPods(deployment, livePods)
so the Degraded condition reflects the deployment shortage rather than
best‑effort pod lookup success. Ensure you reference
listOperandPodsForDiagnostics, nonDeletingPods, hasFailingPods,
dpm.ContainerMessagesForPods, deployment.Status.AvailableReplicas and
desiredReplicas when making the change.
---
Nitpick comments:
In `@pkg/operator/deploymentcontroller/deployment_controller_test.go`:
- Around line 676-739: Add a regression test case that simulates a failure from
Pods().List() in the unavailable-pod path and asserts operator Degraded remains
set (or not cleared) accordingly; specifically, add a new test entry in the
table (similar style to the existing "ProgressDeadlineExceeded" and "sync error"
cases) that injects a fake client error for Pods().List() when
getDeployment()/the unavailable-pod handling runs, then provide an
assertOperatorStatus closure that checks
v1helpers.FindOperatorCondition(status.Conditions, conditionDegraded) is non-nil
and has the expected Status/Reason/Message (matching the desired behavior for
the unavailable-pod error path). Ensure the test uses the same test harness
helpers (makeDeployment, makeFakeOperatorInstance, optionalManifestHooks if
needed) and names the case to indicate "pod-diagnostics failure in
unavailable-pod" so it covers the branch referenced in
getDeployment()/applySyncErrorDegraded logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c402234d-f742-451e-ab6d-12b7d0280738
📒 Files selected for processing (5)
pkg/apps/deployment/status.gopkg/operator/csi/csidrivercontrollerservicecontroller/csi_driver_controller_service_controller.gopkg/operator/deploymentcontroller/deployment_controller.gopkg/operator/deploymentcontroller/deployment_controller_test.gopkg/operator/deploymentcontroller/helpers_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/operator/deploymentcontroller/helpers_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/operator/csi/csidrivercontrollerservicecontroller/csi_driver_controller_service_controller.go
- pkg/apps/deployment/status.go
| case !workloadIsBeingUpdated && deployment.Status.AvailableReplicas < desiredReplicas: | ||
| operandPods := c.listOperandPodsForDiagnostics(ctx, deployment, syncContext) | ||
| livePods := nonDeletingPods(operandPods) | ||
| hasFailing := hasFailingPods(deployment, livePods, now) | ||
| if hasFailing || deployment.Status.AvailableReplicas == 0 { | ||
| containerMessages := dpm.ContainerMessagesForPods(deployment, livePods) |
There was a problem hiding this comment.
Don’t let a failed pod lookup clear Degraded for partial outages.
When AvailableReplicas < desiredReplicas and rollout is already done, listOperandPodsForDiagnostics() collapses selector/list failures into nil, so hasFailingPods() becomes false and this branch reports Degraded=False unless availability dropped all the way to zero. That makes operator health depend on best-effort diagnostics instead of the deployment shortage itself.
Suggested direction
- operandPods := c.listOperandPodsForDiagnostics(ctx, deployment, syncContext)
+ operandPods, diagnosticsOK := c.listOperandPodsForDiagnostics(ctx, deployment, syncContext)
livePods := nonDeletingPods(operandPods)
- hasFailing := hasFailingPods(deployment, livePods, now)
- if hasFailing || deployment.Status.AvailableReplicas == 0 {
- containerMessages := dpm.ContainerMessagesForPods(deployment, livePods)
+ hasFailing := diagnosticsOK && hasFailingPods(deployment, livePods, now)
+ if deployment.Status.AvailableReplicas < desiredReplicas && (!diagnosticsOK || hasFailing || deployment.Status.AvailableReplicas == 0) {
+ var containerMessages []string
+ if diagnosticsOK {
+ containerMessages = dpm.ContainerMessagesForPods(deployment, livePods)
+ }Also applies to: 401-417
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/operator/deploymentcontroller/deployment_controller.go` around lines 362
- 367, When AvailableReplicas < desiredReplicas and rollout is not updating,
don't let a nil/failed pod lookup from listOperandPodsForDiagnostics clear
Degraded; treat missing operandPods as a failure. In the branch around
listOperandPodsForDiagnostics -> operandPods, set hasFailing to true if
operandPods is nil or the diagnostics list indicates a selector/list error (in
addition to the existing hasFailingPods(deployment, livePods, now) check), and
still build containerMessages with whatever diagnostics are available via
dpm.ContainerMessagesForPods(deployment, livePods) so the Degraded condition
reflects the deployment shortage rather than best‑effort pod lookup success.
Ensure you reference listOperandPodsForDiagnostics, nonDeletingPods,
hasFailingPods, dpm.ContainerMessagesForPods,
deployment.Status.AvailableReplicas and desiredReplicas when making the change.
|
@tjungblu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
…controller
Contains the degraded/progressing condition changes made to the workload controller from #2128.
Summary by CodeRabbit
Refactor
Documentation
Tests