fix(onboard): recover openshell gateway bootstrap startup#1824
fix(onboard): recover openshell gateway bootstrap startup#1824cv merged 3 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds gateway bootstrap secret discovery and repair, container inspection and exec helpers, dynamic gateway health-poll wait calculation, in-cluster repair and metadata reattachment during gateway start/recovery flows, exports new helpers, and adds Vitest coverage for wait config and bootstrap script generation. Changes
Sequence DiagramsequenceDiagram
participant Caller as Caller
participant Onboard as Onboard
participant Docker as DockerContainer
participant Health as ClusterHealthcheck
participant Repair as RepairScript
participant CLI as OpenShellCLI
Caller->>Onboard: startGatewayWithOptions / recoverGatewayRuntime
Onboard->>Docker: getGatewayClusterContainerState()
Docker-->>Onboard: containerState
Onboard->>Onboard: getGatewayHealthWaitConfig(startExitCode, containerState)
loop polling iterations
Onboard->>Docker: listMissingGatewayBootstrapSecrets()
Docker-->>Onboard: missingSecrets[]
Onboard->>Onboard: getGatewayBootstrapRepairPlan(missingSecrets)
alt needsRepair
Onboard->>Repair: buildGatewayBootstrapSecretsScript(plan)
Repair-->>Onboard: script
Onboard->>Docker: runGatewayCluster / runGatewayClusterCapture (exec script)
Docker-->>Onboard: exec result
end
Onboard->>Health: gatewayClusterHealthcheckPassed()
Health-->>Onboard: healthy / unhealthy
alt healthy and metadata stale
Onboard->>CLI: attachGatewayMetadataIfNeeded(--local / --force)
CLI-->>Onboard: attach result
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
1832-1845: Consider adding a clarifying comment for the early return.The logic returns early if
hasStaleGateway(gwInfo)is truthy, meaning metadata already exists (even if the container is gone). This is correct—we only need to add metadata when it's completely missing—but the function name and condition could be clearer.💡 Optional: Add a clarifying comment
function attachGatewayMetadataIfNeeded() { const gwInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true }); + // If gateway metadata already exists (even if stale), skip re-adding. if (hasStaleGateway(gwInfo)) return true;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 1832 - 1845, The early return in attachGatewayMetadataIfNeeded when hasStaleGateway(gwInfo) is truthy isn't obvious from the function name; add a brief clarifying comment above that line (inside attachGatewayMetadataIfNeeded, near the call to runCaptureOpenshell and the hasStaleGateway check) explaining that a truthy hasStaleGateway means gateway metadata already exists (even if the container is gone) so we should not reattach and therefore return true early; reference the functions runCaptureOpenshell, hasStaleGateway, and runOpenshell in the comment to make intent explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 1832-1845: The early return in attachGatewayMetadataIfNeeded when
hasStaleGateway(gwInfo) is truthy isn't obvious from the function name; add a
brief clarifying comment above that line (inside attachGatewayMetadataIfNeeded,
near the call to runCaptureOpenshell and the hasStaleGateway check) explaining
that a truthy hasStaleGateway means gateway metadata already exists (even if the
container is gone) so we should not reattach and therefore return true early;
reference the functions runCaptureOpenshell, hasStaleGateway, and runOpenshell
in the comment to make intent explicit.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9f10324f-afa4-4a93-9528-7c6d0972cbaf
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-start-wait.test.ts
|
✨ Thanks for submitting this PR, which proposes a fix for a bug with the OpenShell gateway startup during NemoClaw onboarding. |
prekshivyas
left a comment
There was a problem hiding this comment.
Good recovery logic — self-heals missing bootstrap secrets during first-time startup instead of tearing down. Tests cover wait config, repair planning, and script generation.
One issue: GATEWAY_LOCAL_ENDPOINT hardcodes port 8080 but GATEWAY_PORT (from ports.ts) is already imported and supports NEMOCLAW_GATEWAY_PORT overrides. Should be:
const GATEWAY_LOCAL_ENDPOINT = `https://127.0.0.1:${GATEWAY_PORT}`;Otherwise metadata reattachment breaks for users who override the port.
CI needs rebase. @hungryboy1025 can you fix the port and rebase onto main?
17c2c76 to
59d1ea9
Compare
@prekshivyas sure. I updated the local gateway endpoint to use GATEWAY_PORT, added a regression test for the port override, and rebased the branch onto latest main. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2719-2722: The repairGatewayBootstrapSecrets() call may recreate
mTLS materials but its result is ignored, leaving the host gateway metadata
stale; change repairGatewayBootstrapSecrets() to return a success/changed
boolean (or otherwise detect whether secrets were recreated) and, after calling
it, if it indicates changes then ensure attachGatewayMetadataIfNeeded() is
invoked unconditionally or with a force/refresh flag so metadata is reattached
even if gatewayClusterHealthcheckPassed() would otherwise gate it; apply the
same fix to the other identical call site using the same functions
(repairGatewayBootstrapSecrets, gatewayClusterHealthcheckPassed,
attachGatewayMetadataIfNeeded).
- Around line 2094-2106: The getGatewayBootstrapRepairPlan function should
ignore stray stderr/command text by filtering normalized secret names against
the canonical list GATEWAY_BOOTSTRAP_SECRET_NAMES: build a Set from
GATEWAY_BOOTSTRAP_SECRET_NAMES and replace normalized with normalized.filter(n
=> allowed.has(n)) (or compute a new filtered array) before creating the missing
Set and computing needsClientBundle/needsHandshake/needsServerTls; ensure you
still trim and dedupe first, then intersect with the allowed names and return
that filtered list as missingSecrets so only known secret names keep the repair
path active.
- Around line 2068-2083: The decision to use the extended wait currently only
checks startStatus and thus treats any non-zero exit as "wait long" even when
the container is gone, and treats zero exits as "short wait" even when the
container is present-but-not-ready; change getGatewayHealthWaitConfig to drive
useExtendedWait from the normalized containerState instead: after computing
normalizedState (from containerState), set useExtendedWait = normalizedState !==
"missing" (i.e., container present) and then choose count/interval/extended
based on that flag so live-but-unready containers get the extended wait and
absent containers always get the short wait.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: e08901d8-ac86-43f6-aeb1-8b412b0ff60a
📒 Files selected for processing (2)
src/lib/onboard.tstest/gateway-start-wait.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- test/gateway-start-wait.test.ts
Signed-off-by: hungryboy1025 <gulaer44@gmail.com> Signed-off-by: shangyu.li <gulaer44@gmail.com>
59d1ea9 to
c286ca0
Compare
cv
left a comment
There was a problem hiding this comment.
Merged latest main into this branch, resolved the onboard.ts conflict, incorporated the remaining gateway bootstrap review fixes, and re-ran the focused gateway startup wait test.
## Summary Bumps the published doc version to `0.0.22` and documents the user-visible CLI behavior changes to `nemoclaw <name> connect` that landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.21..origin/main`, filtered through `docs/.docs-skip`. ## Changes - **`docs/project.json`** and **`docs/versions1.json`**: bump the published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry into the version list so the history stays contiguous. - **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document two new behaviors. - Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds; default `120`) that replaces the silent hang when the sandbox is not yet `Ready` — right after onboarding, while the 2.4 GB image is still pulling (#466). - Post-connect hint is now agent-aware, names the correct TUI command for the sandbox's agent, and tells you to use `/exit` to leave the chat before `exit` returns you to the host shell (#2080). Feature PRs that shipped their own docs in the same commit are intentionally not re-documented here: - `channels list/add/remove` (#2139) — command reference and the "`openclaw channels` blocked inside the sandbox" troubleshooting entry landed with the feature. - `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild image cleanup PR. Skipped per `docs/.docs-skip`: - `e6bad533 fix(shields): verify config lock and fail hard on re-lock failure (#2066)` — matched `skip-features: src/lib/shields.ts`. Other commits in the range (#2141 OpenShell version bump, #1819 plugin banner live inference probe, #2085 / #2146 Slack Socket Mode fixes, #2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway bootstrap recovery, and assorted CI / test / install plumbing) are internal behavior refinements with no doc-relevant surface change. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files via the pre-commit hook, including `Regenerate agent skills from docs` (source ↔ generated parity confirmed) - [ ] `npm test` passes — skipped; the one pre-existing `test/cli.test.ts > unknown command exits 1` failure on `origin/main` is unrelated to these markdown/JSON-only changes - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * `connect` now displays the sandbox phase while waiting for readiness and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default 120s). * TTY hints are agent-aware and instruct using `/exit` before returning to the host shell. * **Documentation** * Command docs updated to describe polling, timeout, and TTY guidance. * Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package version bumped to 0.0.22). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fix NemoClaw onboarding when
openshell gateway start --name nemoclawreturns before the embedded cluster is actually healthy.The failure mode reproduced during
./install.shatStarting OpenShell gateway: the OpenShell chart referenced bootstrap TLS/handshake secrets that were not present yet, which leftopenshell-0blocked in Kubernetes and caused onboarding to tear the gateway down too early.Changes
openshell gateway startexits non-zero, instead of immediately treating that as a hard failureopenshell-cluster-nemoclawcontainer during startup to detect live-but-not-ready cluster statesopenshell-server-tlsopenshell-server-client-caopenshell-client-tlsopenshell-ssh-handshakeRoot Cause
During first-time startup, the embedded K3s cluster can still be converging after
openshell gateway startexits. In the reproduced failure, the OpenShell statefulset was created, but required bootstrap secrets were missing, soopenshell-0could not become ready. NemoClaw then used a short health wait and destroyed the gateway before the cluster had a chance to recover.This PR makes onboarding resilient to that startup race and self-heals the missing bootstrap secret condition that previously made installation fail.
Type of Change
Testing
npx prek run --all-filespasses (or equivalentlymake check).npm testpasses.make docsbuilds without warnings. (for doc-only changes)npm run build:cli./node_modules/.bin/vitest run test/gateway-start-wait.test.ts./install.shfailure locally and verified onboarding now passesStarting OpenShell gatewaydocker inspect openshell-cluster-nemoclawreportsrunning healthyopenshell statusreports thenemoclawgateway asConnectedNotes
npm testis currently failing in this workspace due to unrelated pre-existing test failures outside this change set.Checklist
General
Code Changes
npx prek run --all-filesauto-fixes formatting (ormake formatfor targeted runs).Doc Changes
Signed-off-by: shangyu.li gulaer44@gmail.com
Summary by CodeRabbit
New Features
Tests