feat(api): RestartPolicy + --restart flag (data-shape, upstream-aligned to apple/container#1258)#13
Open
chrisgeo wants to merge 2 commits into
Open
feat(api): RestartPolicy + --restart flag (data-shape, upstream-aligned to apple/container#1258)#13chrisgeo wants to merge 2 commits into
chrisgeo wants to merge 2 commits into
Conversation
chrisgeo
added a commit
that referenced
this pull request
May 24, 2026
Implements the full healthcheck observer that populates
`ContainerSnapshot.health` (the read-only field reserved by CHAOS-1319)
by running the configured probe inside the running container,
interpreting exit codes through a Docker-compatible state machine, and
writing the result back through the `ContainersService` actor under a
generation-gated update path.
Motivation
----------
CHAOS-1319 reserved the SDK shape (`HealthStatus` enum + optional
`health` field on `ContainerSnapshot`) but the daemon never populated
it; the field is always `nil` today, so external orchestrators (the
canonical use case is a compose-spec orchestrator implementing
`depends_on.condition: service_healthy`) can only block on
image-baked healthchecks and only when the underlying runtime owns
the probe loop. Real workloads (databases that take seconds to accept
connections, queue brokers that warm up an in-memory state) need a
container-level healthcheck observer that the daemon owns. This PR
adds it.
What this PR changes
--------------------
- Sources/ContainerResource/Container/Healthcheck.swift (new):
public Codable / Sendable struct mirroring the Docker / compose-spec
schema (`test`, `interval`, `timeout`, `retries`, `start_period`,
`start_interval`, `disable`). Validates the probe shape (`NONE` /
`CMD` / `CMD-SHELL`) and rejects malformed inputs with actionable
error messages.
- Sources/ContainerResource/Container/ContainerConfiguration.swift:
new optional `healthcheck: Healthcheck?` field, `decodeIfPresent`
on the wire so legacy on-disk configurations decode unchanged.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthStateMachine.swift (new): pure value type that maps probe
outcomes to `HealthStatus`. Implements the Docker-compatible flow:
initial `.starting`, immediate transition to `.healthy` on the
first successful probe (including during the `start_period` grace
window), `retries` consecutive failures post-grace transition to
`.unhealthy`, recovery to `.healthy` without restart.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthProber.swift (new): `HealthProber` protocol plus production
`SandboxClientHealthProber` that drives an existing `SandboxClient`
to spawn a fresh `__container_healthcheck_<UUID>` synthetic process
per probe, races `wait()` against a per-probe timeout, and signals
`SIGKILL` on timeout to unblock the synthetic wait task before
draining the task group.
- Sources/Services/ContainerAPIService/Server/Containers/
HealthMonitor.swift (new): per-container observer manager actor
that mirrors `ExitMonitor`. `register(id:generation:startedAt:
healthcheck:prober:onUpdate:)` cancels any prior observer, fires
the initial `.starting` (or `.none` for disabled checks) callback,
and runs the probe loop. `unregister(id:)` is idempotent and
triggers cooperative cancellation.
- Sources/Services/ContainerAPIService/Server/Containers/
ContainersService.swift: new private `healthMonitor: HealthMonitor`
field; new `healthGeneration: UInt64` token on `ContainerState`
bumped on every transition into `.running`; observer registered
inside `startProcess` once the init process is up; unregister wired
into `handleContainerExit`. New private `applyHealthUpdate(id:
generation:status:)` is the single mutation entry; it drops updates
whose generation no longer matches the live container or whose
status is no longer `.running`, closing the late-callback /
restart race.
- Sources/Services/ContainerAPIService/Client/Flags.swift: seven new
flags on `Flags.Management` covering `--health-cmd`,
`--health-interval`, `--health-timeout`, `--health-retries`,
`--health-start-period`, `--health-start-interval`, and
`--no-healthcheck`.
- Sources/Services/ContainerAPIService/Client/Utility.swift: new
private `makeHealthcheck(management:)` that translates the flag
bag into a `Healthcheck`. Rejects orphan `--health-*` flags
without `--health-cmd` to catch typos at submit time.
- Package.swift: `ContainerAPIServiceTests` gains a dependency on
the `ContainerAPIService` target so the new tests can use the
`@testable` import.
- Tests:
- Tests/ContainerResourceTests/HealthcheckTest.swift: 12 tests
covering shape parsing (`CMD` / `CMD-SHELL` / `NONE`), validation
error paths, the `disable` flag, the `probeInterval` selection
rule (start-interval inside the grace window only), and a
legacy-config Codable round-trip regression.
- Tests/ContainerAPIServiceTests/HealthStateMachineTest.swift: 10
tests exercising every transition documented in the design:
initial state, success during grace, failure during grace,
failures past grace toward `retries`, success resets the counter,
`unhealthy` recovers without restart, disabled machine ignores
inputs, retries=0 corner case.
- Tests/ContainerAPIServiceTests/HealthMonitorTest.swift: 4 tests
against a `ScriptedProber` actor (deterministic probe outcomes)
and a `StatusRecorder` (ordered update capture). Covers the
disabled-check single-callback path, the `.starting` -> `.healthy`
transition, the consecutive-failure -> `.unhealthy` path, and
the unregister-cancels-loop guarantee.
Design notes
------------
The implementation follows the architecture recommendation produced
during a design consult (see CHAOS-1381 thread): observer placement
in a dedicated actor (mirroring `ExitMonitor`), probe execution
through the existing `createProcess` / `startProcess` / `wait` path
(no new XPC route added), Docker-compatible state machine semantics,
and generation-gated snapshot updates rather than relying on
cancellation alone to suppress stale callbacks.
Wire compatibility
------------------
`ContainerConfiguration.healthcheck` is a new optional field,
decoded with `decodeIfPresent`. Containers persisted by older
daemons round-trip cleanly (covered by
`testLegacyContainerConfigurationDecodesWithoutHealthcheck`). New
CLI flags are independent and have no effect when omitted, so older
clients hitting a newer daemon and vice versa both behave
identically to today.
Known limitations (intentional, follow-up work)
-----------------------------------------------
- The `--health-cmd` CLI shape currently accepts only the shell
form (translated to `["CMD-SHELL", cmd]`). The richer
`["CMD", "exec", "arg1", ...]` form is reachable via API clients
that build `Healthcheck` directly (e.g. compose orchestrators).
Adding a CLI surface for CMD-form probes is a follow-up.
- Daemon restart does not rehydrate health state. On daemon launch,
observers are restarted from `.starting` rather than persisting
probe counters. Per the design consult this is deliberate scope
for v1.
- Probe intervals use Foundation `TimeInterval` (Double seconds).
Compose-spec duration strings (`30s`, `1m30s`) are parsed by the
client (e.g. container-compose) before reaching the API.
Pairs with CHAOS-1319
---------------------
CHAOS-1319 reserved the SDK shape (`ContainerSnapshot.health`).
This PR is the runtime that populates it, closing the loop for
compose-spec `depends_on.condition: service_healthy` against
container-compose orchestrators. CHAOS-1319's PR
(#13) should land first or be batched with
this one.
Verification
------------
- `swift build -c release` clean on macOS 26 / Apple silicon.
- `swift test --filter 'HealthcheckTest|HealthStateMachineTest|
HealthMonitorTest'` passes 26/26: 12 Healthcheck data shape +
Codable + validation, 10 pure HealthStateMachine transitions, 4
HealthMonitor actor lifecycle / cancellation tests.
Adds a new public 'RestartPolicy' type, a 'restartPolicy' field on
'ContainerCreateOptions', and a '--restart' CLI flag on
'container run'.
Motivation
----------
External orchestrators that drive the API server (the canonical use
case is a Compose-spec orchestrator implementing 'service.restart:
always|on-failure|unless-stopped|no') need to declare a restart
policy at container creation time. Today there is no contract: the
client cannot tell the daemon what to do when a container exits, and
even if the client polls for exits and re-launches itself, that
behavior cannot survive a client restart.
Surfacing the policy at the API boundary lets a future restart
manager honor the policy without another wire-format break.
Scope of this PR (deliberately minimal)
---------------------------------------
This PR is data-shape + CLI parsing only. It adds:
- RestartPolicy { mode: { no, always, on-failure, unless-stopped },
maxRetries }, a Sendable Codable struct.
- 'restartPolicy: RestartPolicy?' optional field on
ContainerCreateOptions, defaulted nil at all existing
construction sites.
- '--restart <policy>' option on Flags.Management.
- 'parseRestartPolicy(_:)' helper on Application.Run that converts
the string spec to RestartPolicy?, with on-failure[:N] support.
It does NOT add a restart manager. The daemon stores the policy on
ContainerCreateOptions and never observes container exits to enforce
it. Wiring the actual observer + re-launch loop is a deferred
follow-up that will land as a separate PR.
Why ship a record-only field?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A restart manager is non-trivial: it needs to subscribe to exit
notifications, track per-container retry counts, distinguish 'we
asked it to stop' from 'it crashed', integrate with auto-remove,
and handle daemon restarts. We would rather have that discussion
separately, against a concrete companion design issue, than couple
it to the SDK shape PR. Reserving the SDK shape now lets downstream
tools start coding against the field with the documented 'not yet
enforced' guarantee inline; flipping the implementation on later
does not require another SDK shape break.
Wire compatibility
------------------
ContainerCreateOptions is marshaled as Codable JSON over XPC. Adding
an optional field is forward-compatible:
- Older clients reading from a newer server: ignore the new key.
- Newer clients reading from an older server: decode restartPolicy
as nil.
Files
-----
- Sources/ContainerResource/Container/RestartPolicy.swift (new):
the RestartPolicy struct + Mode enum, with each case and the
'data-shape only' caveat documented inline.
- Sources/ContainerResource/Container/ContainerCreateOptions.swift:
new optional field + defaulted init parameter; existing
construction sites compile unchanged.
- Sources/Services/ContainerAPIService/Client/Flags.swift:
new '--restart <policy>' Option on Flags.Management.
- Sources/ContainerCommands/Container/ContainerRun.swift:
parseRestartPolicy(_:) helper + invocation in run() before
building ContainerCreateOptions.
Verification
------------
Full 'swift build' clean on macOS 26 / Apple silicon (release config,
all targets including downstream consumers of
ContainerCreateOptions).
While this PR was sitting in draft, apple#1258 (by JaewonHur, reviewed by saehejkang) appeared with a substantial in-flight restart manager implementation. To avoid a future merge conflict on every field of this contract, the fork's SDK shape is reshaped to match apple#1258 verbatim before any reviewer sees this PR. Changes ------- RestartPolicy: struct-with-mode-and-maxRetries -> bare String-backed enum { no, onFailure, always }. Matches apple#1258 line-for-line. Drops the 'unless-stopped' mode and the 'maxRetries' field; both are intentionally deferred so this PR mirrors upstream's deliberately conservative initial scope. Follow-ups will land them after apple#1258 merges (CHAOS-1321c). ContainerCreateOptions: restartPolicy goes from optional to non-optional, defaulting to .no. Custom 'init(from:)' uses 'decodeIfPresent ?? .no' so older 'options.json' blobs (no field) still decode cleanly. This is the forward-compat guarantee the original PR description promised but did not actually implement. Flags.swift: '--restart' goes from 'String?' + a hand-rolled 'parseRestartPolicy' helper to '@option var restart: RestartPolicy = .no'. ArgumentParser handles validation natively via a new 'extension RestartPolicy: ExpressibleByArgument {}'; the parser is deleted entirely. As a side benefit, invalid input now produces ArgumentParser's standard 'Invalid value for --restart' error instead of being silently dropped. ContainerRun.swift / ContainerCreate.swift: thread 'managementFlags .restart' through directly. ContainerCreate was missing the wiring entirely in the original PR (the flag declared but never read on the create path) -- fixed here. Tests/ContainerResourceTests/RestartPolicyTests.swift (new): 8 tests covering bare-string encode/decode, every-case round-trip, unknown- string rejection, and the legacy 'options.json' forward-compat invariant (legacy blob without restartPolicy decodes with restartPolicy == .no). Verification ------------ swift build -> Build complete! (21.99s), exit 0. swift test --filter RestartPolicyTests -> 8 tests passed in 0.001s. LSP diagnostics -> clean across all 6 files. Upstream alignment matrix (this PR vs apple#1258) ----------------------------------------------------------- RestartPolicy shape bare enum match unless-stopped mode absent match on-failure:N retries absent match restartPolicy optionality non-optional match decodeIfPresent default .no match Flag binding ExpressibleByArg match RuntimeStatus -> ContainerStatus rename NOT in scope defer The RuntimeStatus rename and the .restarting/.bootstrapped state additions stay deferred -- they belong to the daemon-side enforcement work, which is what apple#1258 actually does. This PR remains data-shape only on the fork side; enforcement arrives when apple#1258 lands upstream and the fork bumps its sync. Refs CHAOS-1321, CHAOS-1385. Sponsors apple#1258, apple#286.
4bccd8e to
43ba70f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Staging branch — fork-internal review before any apple/container upstream filing.
Linear: CHAOS-1321
Upstream: apple/container#1258 (in-flight restart manager) · apple/container#286 (original FR)
Type of Change
Motivation
External orchestrators that drive the API server (canonical use case: a Compose-spec orchestrator implementing
service.restart: always|on-failure|unless-stopped|no) need a way to declare a restart policy at container creation time. Today there is no API contract — the client cannot tell the daemon what to do when a container exits, and even client-side polling cannot survive a client restart.This PR adds the SDK shape (type + flag + option field) so downstream consumers can target a stable contract while daemon-side enforcement lands in apple/container#1258 (in-flight upstream PR by @JaewonHur, reviewed by @saehejkang).
Scope: data-shape + CLI binding only, deliberately upstream-aligned
The shape exactly matches the open upstream PR apple#1258, so when apple#1258 merges, this fork can converge with zero diff in the SDK layer.
Sources/ContainerResource/Container/RestartPolicy.swift(new)public enum RestartPolicy: String, Sendable, Codable, Equatable, CaseIterable { case no, onFailure, always }— bare String-backed enum, matches apple#1258 verbatim.Sources/ContainerResource/Container/ContainerCreateOptions.swiftrestartPolicy: RestartPolicydefaulting to.no. Custominit(from:)usesdecodeIfPresent ?? .nosooptions.jsonblobs written by older daemons still decode.Sources/Services/ContainerAPIService/Client/Flags.swift@Option public var restart: RestartPolicy = .noonFlags.Management, plusextension RestartPolicy: ExpressibleByArgument {}at file scope. ArgumentParser handles validation natively.Sources/ContainerCommands/Container/ContainerRun.swiftmanagementFlags.restartintoContainerCreateOptions(no parsing helper —ExpressibleByArgumentcovers it).Sources/ContainerCommands/Container/ContainerCreate.swiftcreatepath (parity withrun).Tests/ContainerResourceTests/RestartPolicyTests.swift(new)restartPolicy→.no), encode shape,.defaultstatic.Total: 6 files (2 new + 4 modified), +148/−51.
What this PR does NOT include — and why
unless-stoppedmodeon-failure:Nbounded retriesRuntimeStatus → ContainerStatusrename +.restartingstateWire compatibility
ContainerCreateOptionsis marshaled asCodableJSON over XPC. Additive change withdecodeIfPresentdefault:decodeIfPresent(... ) ?? .no→ field present with.nopolicy. Covered bytestDecodesLegacyOptionsWithoutRestartPolicy.Testing
swift buildclean:Build complete! (21.99s)on macOS 26 / Apple silicon (release config, all targets).swift test --filter RestartPolicyTests: 8/8 pass in 0.001s.Strategy / sequencing
unless-stoppedandon-failure:Nupstream as additive extensions after Add support for container restart policy apple/container#1258 stabilizes.container-composealready gates--restartemission behind a runtime capability probe (container-compose#151, merged) — it will auto-resume emission once apple/container ships Add support for container restart policy apple/container#1258. No compose change required from this PR.Previous attempt
PR #6 (
tier2-fork-patchesbranch, merged 2026-04-29) bundled five Tier-2 tickets including a struct-shapedRestartPolicyfor CHAOS-1321. PR #13 is the clean-room re-slice of just the CHAOS-1321 portion against currentmain, and this commit on top reshapes the type to match upstream apple#1258 — so a future fork->upstream rebase is a no-op on the SDK files.