feat(api): wire container healthcheck observer end to end#16
Open
chrisgeo wants to merge 3 commits into
Open
Conversation
Adds a new public enum HealthStatus { none, starting, healthy, unhealthy }
and a new optional 'health: HealthStatus?' field on ContainerSnapshot,
defaulting to nil at all construction sites.
Motivation
----------
External orchestrators that drive the API server (the canonical use
case is the compose-spec depends_on: condition: service_healthy gate)
need to know whether a container is up AND healthy, not just up. Today
ContainerSnapshot exposes .running for any started container, so
consumers have to fall back to .running and treat liveness == health.
Real workloads (databases that take seconds to accept connections,
queue brokers that warm up an in-memory state) hit this regularly and
end up either waiting too long or proceeding too early.
Scope of this PR (deliberately minimal)
---------------------------------------
This PR is data-shape only. It adds the enum and the field to the SDK.
It does NOT wire a healthcheck observer into the daemon: at runtime
the field is always nil, so the on-the-wire behavior is unchanged
modulo one new Codable key on ContainerSnapshot.
Why ship a nil-only field?
~~~~~~~~~~~~~~~~~~~~~~~~~~
A container-level healthcheck observer is a non-trivial design
discussion (where does the spec live? does the API server exec into
the container, or does the runtime drive it? does it leak into the
sandbox boundary?) and we'd rather have that discussion separately,
referencing a concrete companion issue. Reserving the SDK shape now
lets downstream tools start coding against the field with the
'always nil today' guarantee documented inline; flipping the
implementation on later does not require another SDK-shape PR.
Wire compatibility
------------------
ContainerSnapshot is marshaled as Codable JSON over XPC. Adding an
optional field is forward-compatible:
- Older clients reading from a newer server: ignore the new key.
- Newer clients reading from an older server: decode health as nil.
Files
-----
- Sources/ContainerResource/Container/HealthStatus.swift (new):
the enum, with cases documented and a note on the daemon-side
observer caveat.
- Sources/ContainerResource/Container/ContainerSnapshot.swift:
new optional field + init parameter (default nil).
Companion issue
---------------
Filed at apple/container with the design proposal for the eventual
healthcheck observer; this PR is deliberately the smaller surface so
the data shape can land independently of that discussion.
Implements the full healthcheck observer that populates
`ContainerSnapshot.health` (the read-only field reserved by CHAOS-1319)
by running the configured probe inside the running container,
interpreting exit codes through a Docker-compatible state machine, and
writing the result back through the `ContainersService` actor under a
generation-gated update path.
Motivation
----------
CHAOS-1319 reserved the SDK shape (`HealthStatus` enum + optional
`health` field on `ContainerSnapshot`) but the daemon never populated
it; the field is always `nil` today, so external orchestrators (the
canonical use case is a compose-spec orchestrator implementing
`depends_on.condition: service_healthy`) can only block on
image-baked healthchecks and only when the underlying runtime owns
the probe loop. Real workloads (databases that take seconds to accept
connections, queue brokers that warm up an in-memory state) need a
container-level healthcheck observer that the daemon owns. This PR
adds it.
What this PR changes
--------------------
- Sources/ContainerResource/Container/Healthcheck.swift (new):
public Codable / Sendable struct mirroring the Docker / compose-spec
schema (`test`, `interval`, `timeout`, `retries`, `start_period`,
`start_interval`, `disable`). Validates the probe shape (`NONE` /
`CMD` / `CMD-SHELL`) and rejects malformed inputs with actionable
error messages.
- Sources/ContainerResource/Container/ContainerConfiguration.swift:
new optional `healthcheck: Healthcheck?` field, `decodeIfPresent`
on the wire so legacy on-disk configurations decode unchanged.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthStateMachine.swift (new): pure value type that maps probe
outcomes to `HealthStatus`. Implements the Docker-compatible flow:
initial `.starting`, immediate transition to `.healthy` on the
first successful probe (including during the `start_period` grace
window), `retries` consecutive failures post-grace transition to
`.unhealthy`, recovery to `.healthy` without restart.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthProber.swift (new): `HealthProber` protocol plus production
`SandboxClientHealthProber` that drives an existing `SandboxClient`
to spawn a fresh `__container_healthcheck_<UUID>` synthetic process
per probe, races `wait()` against a per-probe timeout, and signals
`SIGKILL` on timeout to unblock the synthetic wait task before
draining the task group.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthMonitor.swift (new): per-container observer manager actor
that mirrors `ExitMonitor`. `register(id:generation:startedAt:
healthcheck:prober:onUpdate:)` cancels any prior observer, fires
the initial `.starting` (or `.none` for disabled checks) callback,
and runs the probe loop. `unregister(id:)` is idempotent and
triggers cooperative cancellation.
- Sources/Services/ContainerAPIService/Server/Containers/
ContainersService.swift: new private `healthMonitor: HealthMonitor`
field; new `healthGeneration: UInt64` token on `ContainerState`
bumped on every transition into `.running`; observer registered
inside `startProcess` once the init process is up; unregister wired
into `handleContainerExit`. New private `applyHealthUpdate(id:
generation:status:)` is the single mutation entry; it drops updates
whose generation no longer matches the live container or whose
status is no longer `.running`, closing the late-callback /
restart race.
- Sources/Services/ContainerAPIService/Client/Flags.swift: seven new
flags on `Flags.Management` covering `--health-cmd`,
`--health-interval`, `--health-timeout`, `--health-retries`,
`--health-start-period`, `--health-start-interval`, and
`--no-healthcheck`.
- Sources/Services/ContainerAPIService/Client/Utility.swift: new
private `makeHealthcheck(management:)` that translates the flag
bag into a `Healthcheck`. Rejects orphan `--health-*` flags
without `--health-cmd` to catch typos at submit time.
- Package.swift: `ContainerAPIServiceTests` gains a dependency on
the `ContainerAPIService` target so the new tests can use the
`@testable` import.
- Tests:
- Tests/ContainerResourceTests/HealthcheckTest.swift: 12 tests
covering shape parsing (`CMD` / `CMD-SHELL` / `NONE`), validation
error paths, the `disable` flag, the `probeInterval` selection
rule (start-interval inside the grace window only), and a
legacy-config Codable round-trip regression.
- Tests/ContainerAPIServiceTests/HealthStateMachineTest.swift: 10
tests exercising every transition documented in the design:
initial state, success during grace, failure during grace,
failures past grace toward `retries`, success resets the counter,
`unhealthy` recovers without restart, disabled machine ignores
inputs, retries=0 corner case.
- Tests/ContainerAPIServiceTests/HealthMonitorTest.swift: 4 tests
against a `ScriptedProber` actor (deterministic probe outcomes)
and a `StatusRecorder` (ordered update capture). Covers the
disabled-check single-callback path, the `.starting` -> `.healthy`
transition, the consecutive-failure -> `.unhealthy` path, and
the unregister-cancels-loop guarantee.
Design notes
------------
The implementation follows the architecture recommendation produced
during a design consult (see CHAOS-1381 thread): observer placement
in a dedicated actor (mirroring `ExitMonitor`), probe execution
through the existing `createProcess` / `startProcess` / `wait` path
(no new XPC route added), Docker-compatible state machine semantics,
and generation-gated snapshot updates rather than relying on
cancellation alone to suppress stale callbacks.
Wire compatibility
------------------
`ContainerConfiguration.healthcheck` is a new optional field,
decoded with `decodeIfPresent`. Containers persisted by older
daemons round-trip cleanly (covered by
`testLegacyContainerConfigurationDecodesWithoutHealthcheck`). New
CLI flags are independent and have no effect when omitted, so older
clients hitting a newer daemon and vice versa both behave
identically to today.
Known limitations (intentional, follow-up work)
-----------------------------------------------
- The `--health-cmd` CLI shape currently accepts only the shell
form (translated to `["CMD-SHELL", cmd]`). The richer
`["CMD", "exec", "arg1", ...]` form is reachable via API clients
that build `Healthcheck` directly (e.g. compose orchestrators).
Adding a CLI surface for CMD-form probes is a follow-up.
- Daemon restart does not rehydrate health state. On daemon launch,
observers are restarted from `.starting` rather than persisting
probe counters. Per the design consult this is deliberate scope
for v1.
- Probe intervals use Foundation `TimeInterval` (Double seconds).
Compose-spec duration strings (`30s`, `1m30s`) are parsed by the
client (e.g. container-compose) before reaching the API.
Pairs with CHAOS-1319
---------------------
CHAOS-1319 reserved the SDK shape (`ContainerSnapshot.health`).
This PR is the runtime that populates it, closing the loop for
compose-spec `depends_on.condition: service_healthy` against
container-compose orchestrators. CHAOS-1319's PR
(#13) should land first or be batched with
this one.
Verification
------------
- `swift build -c release` clean on macOS 26 / Apple silicon.
- `swift test --filter 'HealthcheckTest|HealthStateMachineTest|
HealthMonitorTest'` passes 26/26: 12 Healthcheck data shape +
Codable + validation, 10 pure HealthStateMachine transitions, 4
HealthMonitor actor lifecycle / cancellation tests.
7cdcd25 to
be4aee0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the full healthcheck observer that populates
ContainerSnapshot.health(the read-only field reserved by CHAOS-1319) by running the configured probe inside the running container, interpreting exit codes through a Docker-compatible state machine, and writing the result back through theContainersServiceactor under a generation-gated update path.This is the upstream-shaped staging PR for CHAOS-1381. Together with CHAOS-1319 (which reserves the SDK shape) it closes the loop for compose-spec
depends_on.condition: service_healthyagainst compose-spec orchestrators.Stacking
This branch is based on
feat/chaos-1319-health-statusso theHealthStatusenum andContainerSnapshot.healthfield exist in the diff. CHAOS-1319 (PR #13) should land first or be batched with this one. The combined change is reviewable in a single pass; once #13 merges, this PR rebases cleanly ontomain.Architecture (per design consult)
Key decisions:
HealthMonitoractor (mirrorsExitMonitor); keepsContainersServicefrom growing further and provides a clean cancellation boundary.createProcess/startProcess/waitpath; no new XPC route added. Synthetic process id is__container_healthcheck_<UUID>. Stdio is intentionally not forwarded.Task.sleep. Timed-out probes are killed with SIGKILL before the group drains so thewait()task can return..runningbumpsContainerState.healthGeneration. Late callbacks from a previous container instance are dropped atapplyHealthUpdate(gen mismatch + status check).CLI surface
--health-cmd <shell>[\"CMD-SHELL\", cmd]; runs via/bin/sh -cinside the container.--health-interval <s>--health-timeout <s>--health-retries <n>--health-start-period <s>--health-start-interval <s>--no-healthchecktest=[\"NONE\"]; bypasses any image-baked healthcheck.The richer
[\"CMD\", \"exec\", \"arg1\", ...]form is reachable via API clients that buildHealthcheckdirectly (e.g. compose orchestrators) — CLI surface for CMD-form probes is follow-up work called out in the commit.Wire compatibility
ContainerConfiguration.healthcheckis a new optional field, decoded withdecodeIfPresent. Containers persisted by older daemons round-trip cleanly (covered bytestLegacyContainerConfigurationDecodesWithoutHealthcheck). New CLI flags are independent and have no effect when omitted.Known limitations (intentional, follow-up work)
--health-cmdaccepts only the shell form. CMD-form CLI surface is follow-up..startingrather than persisting counters across daemon launches. Deliberate v1 scope per the design consult.TimeInterval(Double seconds). Compose-spec duration strings ("30s", "1m30s") are parsed by the client.Verification