Gear for OpenShift CI#285
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Jakob-Naucke The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Reviewer's GuideRefactors test utilities and operator logic to better support OpenShift CI by switching from Route-based exposure to LoadBalancer services with certificate-driven endpoints, replaces custom polling logic with kube::runtime::wait::await_condition-based waits, tightens cleanup/resource lifecycle handling, and simplifies dependencies/RBAC for OpenShift and other platforms. Sequence diagram for updated reconcile component installation flowsequenceDiagram
participant Reconciler as reconcile
participant Cluster as TrustedExecutionCluster
participant Installer as install_components
participant Trustee as install_trustee_configuration
participant RegServer as install_register_server
participant AttestKeyReg as install_attestation_key_register
Reconciler->>Installer: install_components(&Client, &TrustedExecutionCluster)
activate Installer
Installer->>Trustee: install_trustee_configuration(Client, &TrustedExecutionCluster)
activate Trustee
Trustee-->>Installer: Result<()> (propagated error on failure)
deactivate Trustee
Installer->>RegServer: install_register_server(Client, &TrustedExecutionCluster)
activate RegServer
RegServer-->>Installer: Result<()> (propagated error on failure)
deactivate RegServer
Installer->>AttestKeyReg: install_attestation_key_register(Client, &TrustedExecutionCluster)
activate AttestKeyReg
AttestKeyReg-->>Installer: Result<()> (propagated error on failure)
deactivate AttestKeyReg
alt all components installed successfully
Installer-->>Reconciler: Ok(())
Reconciler->>Reconciler: installed_condition(...)
Reconciler-->>Reconciler: Action::await_change()
else any installation error
Installer-->>Reconciler: Err(e)
Reconciler->>Reconciler: warn!("Installation of a component failed: {e:?}")
Reconciler-->>Reconciler: Action::requeue(Duration::from_secs(60))
end
deactivate Installer
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
OpenShift::get_urlthe chain ofunwrap()calls onstatus,load_balancer, andingresswill panic if the Service exists but those fields are not yet populated; consider handling these asOptions (and possibly reusing the same readiness condition used inexpose) to fail gracefully instead of crashing tests. - The new
OpenShift::exposeimplementation uses fixed 60s timeouts for both LoadBalancer ingress readiness and certificate reissuance, whereas other helpers usescaled_duration; you might want to align these timeouts with the scaling logic to reduce spurious failures on slower CI clusters.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `OpenShift::get_url` the chain of `unwrap()` calls on `status`, `load_balancer`, and `ingress` will panic if the Service exists but those fields are not yet populated; consider handling these as `Option`s (and possibly reusing the same readiness condition used in `expose`) to fail gracefully instead of crashing tests.
- The new `OpenShift::expose` implementation uses fixed 60s timeouts for both LoadBalancer ingress readiness and certificate reissuance, whereas other helpers use `scaled_duration`; you might want to align these timeouts with the scaling logic to reduce spurious failures on slower CI clusters.
## Individual Comments
### Comment 1
<location path="Makefile" line_range="19" />
<code_context>
YQ_VERSION ?= $(shell go list -m -f '{{.Version}}' github.com/mikefarah/yq/v4)
YQ ?= $(LOCALBIN)/yq-$(YQ_VERSION)
-KOPIUM_VERSION ?= $(shell grep kopium lib/Cargo.toml | sed -E 's/.*"(.*)"/\1/')
+KOPIUM_VERSION ?= $(shell cargo metadata --format-version 1 | jq -r '.resolve.nodes[] | select(.deps[]?.name == "kopium") | .deps[] | select(.name == "kopium") | .pkg | split("@")[1]')
KOPIUM ?= $(LOCALBIN)/kopium-$(KOPIUM_VERSION)
</code_context>
<issue_to_address>
**suggestion:** The new KOPIUM_VERSION detection adds a hard dependency on `jq` and a relatively complex `cargo metadata` query.
This approach is more robust than grepping `Cargo.toml`, but it now relies on `jq` being available everywhere the Makefile runs and on the stability of the `cargo metadata` layout. If that’s not guaranteed (e.g., minimal CI images, some dev setups), please either document/enforce the `jq` requirement with a clear failure message, or simplify how the version is resolved (e.g., restrict the query scope or move the logic into a small helper script/binary) to reduce fragility.
Suggested implementation:
```
JQ ?= jq
ifndef KOPIUM_VERSION
ifeq (, $(shell command -v $(JQ) 2>/dev/null))
$(error jq is required to auto-detect KOPIUM_VERSION; install jq or set KOPIUM_VERSION explicitly, e.g. 'make KOPIUM_VERSION=...')
endif
endif
KOPIUM_VERSION ?= $(shell cargo metadata --format-version 1 | $(JQ) -r '.resolve.nodes[] | select(.deps[]?.name == "kopium") | .deps[] | select(.name == "kopium") | .pkg | split("@")[1]')
KOPIUM ?= $(LOCALBIN)/kopium-$(KOPIUM_VERSION)
```
If the Makefile already defines a `JQ` variable elsewhere (not visible in the provided snippet), you should remove one of the definitions or consolidate them to avoid redefinition. Otherwise, this change will:
- Fail early with a clear error if `jq` is missing and `KOPIUM_VERSION` is not provided.
- Still allow CI/devs to bypass the `jq` requirement by explicitly supplying `KOPIUM_VERSION` on the command line or in the environment.
</issue_to_address>
### Comment 2
<location path="tests/trusted_execution_cluster.rs" line_range="301-303" />
<code_context>
- // Poll for the AttestationKey to be approved, have owner reference, and have a Secret created
+ // Timeout for the AttestationKey to be approved, have owner reference, and have a Secret created
+ let approved = await_condition(attestation_keys.clone(), &ak_name, ak_approved);
+ let ctx = format!("waiting for AttestationKey {ak_name} to be approved");
+ timeout(Duration::from_secs(30), approved).await.context(ctx)??;
+ let chk_machine_owner = |ak: Option<&AttestationKey>| {
+ let chk_owner = |owner: &OwnerReference| owner.kind == "Machine" && owner.name == machine_name;
</code_context>
<issue_to_address>
**suggestion (testing):** Use scaled_duration for timeouts to keep tests robust across slower environments
This and a few other tests in this file hard-code `Duration::from_secs(30)` instead of using `scaled_duration(..)`. That can make tests flaky on slower CI or busy clusters, particularly for controller-driven operations. Please switch to `timeout(scaled_duration(30), ...)` here for consistency and to respect the global test timeout scaling knob.
Suggested implementation:
```rust
// Timeout for the AttestationKey to be approved, have owner reference, and have a Secret created
let approved = await_condition(attestation_keys.clone(), &ak_name, ak_approved);
let ctx = format!("waiting for AttestationKey {ak_name} to be approved");
timeout(scaled_duration(30), approved).await.context(ctx)??;
let chk_machine_owner = |ak: Option<&AttestationKey>| {
```
```rust
let has_machine_owner = await_condition(attestation_keys.clone(), &ak_name, chk_machine_owner);
let ctx = format!("waiting for AttestationKey {ak_name} to be owned by Machine {machine_name}");
timeout(scaled_duration(30), has_machine_owner).await.context(ctx)??;
let secrets_api: Api<Secret> = Api::namespaced(client.clone(), namespace);
```
</issue_to_address>
### Comment 3
<location path="tests/trusted_execution_cluster.rs" line_range="106-107" />
<code_context>
// Wait until it disappears
- wait_for_resource_deleted(&api, TEC_NAME, scaled_timeout(120), 5).await?;
+ wait_for_resource_deleted(&api, TEC_NAME, 120).await?;
let deployments_api: Api<Deployment> = Api::namespaced(client.clone(), namespace);
</code_context>
<issue_to_address>
**suggestion:** Consider using scaled_timeout for resource deletion waits instead of fixed seconds
This changes the deletion wait from `scaled_timeout(120)` to a fixed `120`, while other calls here still use `scaled_timeout(..)`. To stay consistent and reduce flakiness on slower clusters, pass a scaled value (e.g. `scaled_timeout(120)`) into `wait_for_resource_deleted` instead of a hard-coded timeout.
```suggestion
// Wait until it disappears
wait_for_resource_deleted(&api, TEC_NAME, scaled_timeout(120)).await?;
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| YQ_VERSION ?= $(shell go list -m -f '{{.Version}}' github.com/mikefarah/yq/v4) | ||
| YQ ?= $(LOCALBIN)/yq-$(YQ_VERSION) | ||
| KOPIUM_VERSION ?= $(shell grep kopium lib/Cargo.toml | sed -E 's/.*"(.*)"/\1/') | ||
| KOPIUM_VERSION ?= $(shell cargo metadata --format-version 1 | jq -r '.resolve.nodes[] | select(.deps[]?.name == "kopium") | .deps[] | select(.name == "kopium") | .pkg | split("@")[1]') |
There was a problem hiding this comment.
suggestion: The new KOPIUM_VERSION detection adds a hard dependency on jq and a relatively complex cargo metadata query.
This approach is more robust than grepping Cargo.toml, but it now relies on jq being available everywhere the Makefile runs and on the stability of the cargo metadata layout. If that’s not guaranteed (e.g., minimal CI images, some dev setups), please either document/enforce the jq requirement with a clear failure message, or simplify how the version is resolved (e.g., restrict the query scope or move the logic into a small helper script/binary) to reduce fragility.
Suggested implementation:
JQ ?= jq
ifndef KOPIUM_VERSION
ifeq (, $(shell command -v $(JQ) 2>/dev/null))
$(error jq is required to auto-detect KOPIUM_VERSION; install jq or set KOPIUM_VERSION explicitly, e.g. 'make KOPIUM_VERSION=...')
endif
endif
KOPIUM_VERSION ?= $(shell cargo metadata --format-version 1 | $(JQ) -r '.resolve.nodes[] | select(.deps[]?.name == "kopium") | .deps[] | select(.name == "kopium") | .pkg | split("@")[1]')
KOPIUM ?= $(LOCALBIN)/kopium-$(KOPIUM_VERSION)
If the Makefile already defines a JQ variable elsewhere (not visible in the provided snippet), you should remove one of the definitions or consolidate them to avoid redefinition. Otherwise, this change will:
- Fail early with a clear error if
jqis missing andKOPIUM_VERSIONis not provided. - Still allow CI/devs to bypass the
jqrequirement by explicitly supplyingKOPIUM_VERSIONon the command line or in the environment.
6aebf10 to
44865dd
Compare
| const TRUSTEE_VERSION: &str = "v0.17.0"; | ||
| /// Default version tag for operator-managed component images | ||
| const COMPONENT_VERSION: &str = "0.2.0"; | ||
| const COMPONENT_VERSION: &str = "v0.2.0"; |
There was a problem hiding this comment.
Not related to this PR, but the versioning is actually wrong since we already have the release 0.2.1. This should be latest and we should set dynmically when we do the release. Filed #286
| async fn expose( | ||
| &self, | ||
| _: &Client, | ||
| client: &Client, | ||
| namespace: &str, | ||
| service: &str, | ||
| deployment: &str, | ||
| cert_name: &str, | ||
| _: &str, |
There was a problem hiding this comment.
can we simplify the signature of this function? parameter seems a lot to me
| let info = format!("Waiting for deployment {depl} to be ready"); | ||
| test_info!(&self.test_name, "{info}"); | ||
| let done = await_condition(depls.clone(), depl, depl_ready); | ||
| timeout(scaled_duration(300), done).await??; |
There was a problem hiding this comment.
nit:
| timeout(scaled_duration(300), done).await??; | |
| timeout(scaled_duration(300), done).await.context(ctx)??; |
| // warn with `:?` to also get context | ||
| warn!( | ||
| "Installation of a component failed: {e:?}\n \ | ||
| This can occur due to owner references to incompletely propagated CRDs. Requeueing..." |
There was a problem hiding this comment.
what if it is another type of error? You are returning an error also on creation for example
There was a problem hiding this comment.
This is intended for on creation. Do you mean that we should have a maximum failure and installation failed status?
There was a problem hiding this comment.
No, just if the only cause for the error can be the owner reference or maybe something like create the one of the object because it already exists
There was a problem hiding this comment.
for "if exists", we have create_or_info_if_exists, but there are more possible errors here. ideally, the second line of the warning would only be present in the specific case I saw this, but then you're getting into error parsing.
There was a problem hiding this comment.
Maybe just keep it generic and just print the error
44865dd to
eef31f6
Compare
2271a31 to
ae237e7
Compare
Kind CI, if we make one, will not run on Azure. This partially reverts commit cc6bf85. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
This reverts commit 2e4f6d5. CI requires cargo anyhow. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
should be v0.2.0, not 0.2.0 Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
following trusted-execution-clusters#266 Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
also works for deployments that don't exist yet Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
does not need fields to come or parameters Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
to avoid too many parameters Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Patch services to LoadBalancer ones to expose on OpenShift. Patch certificates once address is known. Depending on platform, ingress may be an IP address and thus entirely unpredictable. Therefore, patch TrustedExecutionCluster with Trustee address after service creation. This also reverts commit f2ad17b. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
following trusted-execution-clusters#246 Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
which can also be set for `make integration-tests`. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
Custom data is sometimes not yet available at Ignition fetch stage, i.e. fetch skips our merge even though other parts of config are available. Use user-data (IMDS) instead. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
which can occur because creation of resources with owner refereces to resources of incompletely propagated CRDs can fail Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
PR trusted-execution-clusters#278 correctly recognized that Machines sometimes stayed behind when deleting the namespace immediately after TEC. This extends to ApprovedImages and AttestationKeys, but because all these resources are owned, wait for their removal before namespace removal instead of deleting them. This also reverts commit ed64522. Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
ae237e7 to
30ef74a
Compare
important especially for OpenShift exposure, but can matter on all platforms Signed-off-by: Jakob Naucke <jnaucke@redhat.com>
| }; | ||
| let ingress_ready = await_condition(services, service, has_ingress); | ||
| let ctx = format!("waiting for ingress on {service} to be ready"); | ||
| let duration = scaled_duration(60); |
There was a problem hiding this comment.
How exactly are we estimating the multiplier scale env and scale timeout duration across all files? Is it a ballpark estimate or are we using some heuristic?
There was a problem hiding this comment.
These are not backed by measurements or anything; they're set based on experience and can be increased (if we see a failure being an issue of merely a timeout that's too short) or decreased (if a test frequently takes more time to fail than it would need to).
Similar story with the multipliers, just a tradeoff of how fast is your environment/how long are you willing to wait
await_conditioninstead of polling when possible. Is also more flexible for deployments that don't exist yet.Summary by Sourcery
Adapt the operator, tests, and tooling for running on OpenShift-based CI while modernizing Kubernetes waiting logic and simplifying dependencies.
New Features:
Bug Fixes:
Enhancements:
Tests: