Skip to content

fix(onboard): recover openshell gateway bootstrap startup#1824

Merged
cv merged 3 commits intoNVIDIA:mainfrom
hungryboy1025:fix/openshell-gateway-bootstrap-start
Apr 21, 2026
Merged

fix(onboard): recover openshell gateway bootstrap startup#1824
cv merged 3 commits intoNVIDIA:mainfrom
hungryboy1025:fix/openshell-gateway-bootstrap-start

Conversation

@hungryboy1025
Copy link
Copy Markdown
Contributor

@hungryboy1025 hungryboy1025 commented Apr 13, 2026

Summary

Fix NemoClaw onboarding when openshell gateway start --name nemoclaw returns before the embedded cluster is actually healthy.

The failure mode reproduced during ./install.sh at Starting OpenShell gateway: the OpenShell chart referenced bootstrap TLS/handshake secrets that were not present yet, which left openshell-0 blocked in Kubernetes and caused onboarding to tear the gateway down too early.

Changes

  • extend gateway health polling whenever openshell gateway start exits non-zero, instead of immediately treating that as a hard failure
  • inspect the embedded openshell-cluster-nemoclaw container during startup to detect live-but-not-ready cluster states
  • automatically repair missing OpenShell bootstrap secrets inside the cluster:
    • openshell-server-tls
    • openshell-server-client-ca
    • openshell-client-tls
    • openshell-ssh-handshake
  • generate the client CA and client certificate bundle together so the gateway server and client secrets stay consistent
  • reattach local OpenShell gateway metadata once the cluster healthcheck passes
  • add regression coverage for the extended wait behavior and bootstrap secret repair planning

Root Cause

During first-time startup, the embedded K3s cluster can still be converging after openshell gateway start exits. In the reproduced failure, the OpenShell statefulset was created, but required bootstrap secrets were missing, so openshell-0 could not become ready. NemoClaw then used a short health wait and destroyed the gateway before the cluster had a chance to recover.

This PR makes onboarding resilient to that startup race and self-heals the missing bootstrap secret condition that previously made installation fail.

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only. Prose changes without code sample modifications.
  • Doc only. Includes code sample changes.

Testing

  • npx prek run --all-files passes (or equivalently make check).
  • npm test passes.
  • make docs builds without warnings. (for doc-only changes)
  • npm run build:cli
  • ./node_modules/.bin/vitest run test/gateway-start-wait.test.ts
  • Reproduced ./install.sh failure locally and verified onboarding now passes Starting OpenShell gateway
  • Verified docker inspect openshell-cluster-nemoclaw reports running healthy
  • Verified openshell status reports the nemoclaw gateway as Connected

Notes

  • Full npm test is currently failing in this workspace due to unrelated pre-existing test failures outside this change set.
  • No doc page was updated because this fixes startup robustness without changing documented user workflows or CLI semantics.

Checklist

General

  • I have read and followed the contributing guide.
  • I have read and followed the style guide. (for doc-only changes)

Code Changes

  • Formatters applied — npx prek run --all-files auto-fixes formatting (or make format for targeted runs).
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Doc pages updated for any user-facing behavior changes (new commands, changed defaults, new features, bug fixes that contradict existing docs).

Doc Changes

  • Follows the style guide.
  • New pages include SPDX license header and frontmatter, if creating a new page.
  • Cross-references and links verified.

Signed-off-by: shangyu.li gulaer44@gmail.com

Summary by CodeRabbit

  • New Features

    • Gateway startup/recovery now auto-detects and repairs missing bootstrap secrets, reattaches gateway metadata when healthy, and runs repairs during health polling.
    • Adaptive health-poll behavior: longer waits for slow or atypical container states, shorter waits after successful starts.
  • Tests

    • Added tests covering startup/wait scenarios, extended polling logic, local endpoint resolution, and bootstrap-secret repair/script generation.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds gateway bootstrap secret discovery and repair, container inspection and exec helpers, dynamic gateway health-poll wait calculation, in-cluster repair and metadata reattachment during gateway start/recovery flows, exports new helpers, and adds Vitest coverage for wait config and bootstrap script generation.

Changes

Cohort / File(s) Summary
Gateway Onboard Logic
src/lib/onboard.ts
Adds gateway bootstrap secret allowlist and helpers: container name/state (getGatewayClusterContainerName, getGatewayClusterContainerState), health wait config (getGatewayHealthWaitConfig), missing-secret discovery/plan (listMissingGatewayBootstrapSecrets, getGatewayBootstrapRepairPlan), bootstrap script generation (buildGatewayBootstrapSecretsScript), in-container exec helpers (runGatewayCluster, runGatewayClusterCapture), in-cluster healthcheck/repair/reattach flows, integrates these into startGatewayWithOptions() and recoverGatewayRuntime(), and exports new functions.
Tests: Start/Wait Behavior
test/gateway-start-wait.test.ts
Adds Vitest suite that loads built module dynamically, isolates process.env and require cache, and verifies getGatewayHealthWaitConfig across start/exit and container states, getGatewayLocalEndpoint, getGatewayBootstrapRepairPlan, and buildGatewayBootstrapSecretsScript output and normalization.

Sequence Diagram

sequenceDiagram
    participant Caller as Caller
    participant Onboard as Onboard
    participant Docker as DockerContainer
    participant Health as ClusterHealthcheck
    participant Repair as RepairScript
    participant CLI as OpenShellCLI

    Caller->>Onboard: startGatewayWithOptions / recoverGatewayRuntime
    Onboard->>Docker: getGatewayClusterContainerState()
    Docker-->>Onboard: containerState
    Onboard->>Onboard: getGatewayHealthWaitConfig(startExitCode, containerState)

    loop polling iterations
        Onboard->>Docker: listMissingGatewayBootstrapSecrets()
        Docker-->>Onboard: missingSecrets[]
        Onboard->>Onboard: getGatewayBootstrapRepairPlan(missingSecrets)
        alt needsRepair
            Onboard->>Repair: buildGatewayBootstrapSecretsScript(plan)
            Repair-->>Onboard: script
            Onboard->>Docker: runGatewayCluster / runGatewayClusterCapture (exec script)
            Docker-->>Onboard: exec result
        end
        Onboard->>Health: gatewayClusterHealthcheckPassed()
        Health-->>Onboard: healthy / unhealthy
        alt healthy and metadata stale
            Onboard->>CLI: attachGatewayMetadataIfNeeded(--local / --force)
            CLI-->>Onboard: attach result
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through secrets, mended certs by lantern-light,
Inspected containers, nudged healthchecks into sight,
I stitched the scripts and whispered endpoints true,
Reattached the memory — a tiny rabbit’s do! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): recover openshell gateway bootstrap startup' directly reflects the main change: adding gateway bootstrap secret recovery and extended health polling during OpenShell gateway startup to prevent premature teardown due to incomplete K3s cluster initialization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

1832-1845: Consider adding a clarifying comment for the early return.

The logic returns early if hasStaleGateway(gwInfo) is truthy, meaning metadata already exists (even if the container is gone). This is correct—we only need to add metadata when it's completely missing—but the function name and condition could be clearer.

💡 Optional: Add a clarifying comment
 function attachGatewayMetadataIfNeeded() {
   const gwInfo = runCaptureOpenshell(["gateway", "info", "-g", GATEWAY_NAME], { ignoreError: true });
+  // If gateway metadata already exists (even if stale), skip re-adding.
   if (hasStaleGateway(gwInfo)) return true;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 1832 - 1845, The early return in
attachGatewayMetadataIfNeeded when hasStaleGateway(gwInfo) is truthy isn't
obvious from the function name; add a brief clarifying comment above that line
(inside attachGatewayMetadataIfNeeded, near the call to runCaptureOpenshell and
the hasStaleGateway check) explaining that a truthy hasStaleGateway means
gateway metadata already exists (even if the container is gone) so we should not
reattach and therefore return true early; reference the functions
runCaptureOpenshell, hasStaleGateway, and runOpenshell in the comment to make
intent explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 1832-1845: The early return in attachGatewayMetadataIfNeeded when
hasStaleGateway(gwInfo) is truthy isn't obvious from the function name; add a
brief clarifying comment above that line (inside attachGatewayMetadataIfNeeded,
near the call to runCaptureOpenshell and the hasStaleGateway check) explaining
that a truthy hasStaleGateway means gateway metadata already exists (even if the
container is gone) so we should not reattach and therefore return true early;
reference the functions runCaptureOpenshell, hasStaleGateway, and runOpenshell
in the comment to make intent explicit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9f10324f-afa4-4a93-9528-7c6d0972cbaf

📥 Commits

Reviewing files that changed from the base of the PR and between d4aac4c and ac73184.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-start-wait.test.ts

@wscurran wscurran added bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 13, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR, which proposes a fix for a bug with the OpenShell gateway startup during NemoClaw onboarding.

@wscurran wscurran added the OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents label Apr 13, 2026
@cv cv added the v0.0.18 Release target label Apr 16, 2026
@prekshivyas prekshivyas self-assigned this Apr 16, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good recovery logic — self-heals missing bootstrap secrets during first-time startup instead of tearing down. Tests cover wait config, repair planning, and script generation.

One issue: GATEWAY_LOCAL_ENDPOINT hardcodes port 8080 but GATEWAY_PORT (from ports.ts) is already imported and supports NEMOCLAW_GATEWAY_PORT overrides. Should be:

const GATEWAY_LOCAL_ENDPOINT = `https://127.0.0.1:${GATEWAY_PORT}`;

Otherwise metadata reattachment breaks for users who override the port.

CI needs rebase. @hungryboy1025 can you fix the port and rebase onto main?

@ericksoa ericksoa added v0.0.19 Release target and removed v0.0.18 Release target labels Apr 17, 2026
@hungryboy1025 hungryboy1025 force-pushed the fix/openshell-gateway-bootstrap-start branch from 17c2c76 to 59d1ea9 Compare April 17, 2026 03:35
@hungryboy1025
Copy link
Copy Markdown
Contributor Author

hungryboy1025 commented Apr 17, 2026

Good recovery logic — self-heals missing bootstrap secrets during first-time startup instead of tearing down. Tests cover wait config, repair planning, and script generation.

One issue: GATEWAY_LOCAL_ENDPOINT hardcodes port 8080 but GATEWAY_PORT (from ports.ts) is already imported and supports NEMOCLAW_GATEWAY_PORT overrides. Should be:

const GATEWAY_LOCAL_ENDPOINT = `https://127.0.0.1:${GATEWAY_PORT}`;

Otherwise metadata reattachment breaks for users who override the port.

CI needs rebase. @hungryboy1025 can you fix the port and rebase onto main?

@prekshivyas sure. I updated the local gateway endpoint to use GATEWAY_PORT, added a regression test for the port override, and rebased the branch onto latest main.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2719-2722: The repairGatewayBootstrapSecrets() call may recreate
mTLS materials but its result is ignored, leaving the host gateway metadata
stale; change repairGatewayBootstrapSecrets() to return a success/changed
boolean (or otherwise detect whether secrets were recreated) and, after calling
it, if it indicates changes then ensure attachGatewayMetadataIfNeeded() is
invoked unconditionally or with a force/refresh flag so metadata is reattached
even if gatewayClusterHealthcheckPassed() would otherwise gate it; apply the
same fix to the other identical call site using the same functions
(repairGatewayBootstrapSecrets, gatewayClusterHealthcheckPassed,
attachGatewayMetadataIfNeeded).
- Around line 2094-2106: The getGatewayBootstrapRepairPlan function should
ignore stray stderr/command text by filtering normalized secret names against
the canonical list GATEWAY_BOOTSTRAP_SECRET_NAMES: build a Set from
GATEWAY_BOOTSTRAP_SECRET_NAMES and replace normalized with normalized.filter(n
=> allowed.has(n)) (or compute a new filtered array) before creating the missing
Set and computing needsClientBundle/needsHandshake/needsServerTls; ensure you
still trim and dedupe first, then intersect with the allowed names and return
that filtered list as missingSecrets so only known secret names keep the repair
path active.
- Around line 2068-2083: The decision to use the extended wait currently only
checks startStatus and thus treats any non-zero exit as "wait long" even when
the container is gone, and treats zero exits as "short wait" even when the
container is present-but-not-ready; change getGatewayHealthWaitConfig to drive
useExtendedWait from the normalized containerState instead: after computing
normalizedState (from containerState), set useExtendedWait = normalizedState !==
"missing" (i.e., container present) and then choose count/interval/extended
based on that flag so live-but-unready containers get the extended wait and
absent containers always get the short wait.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: e08901d8-ac86-43f6-aeb1-8b412b0ff60a

📥 Commits

Reviewing files that changed from the base of the PR and between 17c2c76 and 59d1ea9.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/gateway-start-wait.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/gateway-start-wait.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts
Comment thread src/lib/onboard.ts Outdated
Signed-off-by: hungryboy1025 <gulaer44@gmail.com>
Signed-off-by: shangyu.li <gulaer44@gmail.com>
@hungryboy1025 hungryboy1025 force-pushed the fix/openshell-gateway-bootstrap-start branch from 59d1ea9 to c286ca0 Compare April 17, 2026 03:55
@ericksoa ericksoa added v0.0.20 Release target and removed v0.0.19 Release target labels Apr 18, 2026
@cv cv added v0.0.21 Release target v0.0.22 Release target and removed v0.0.20 Release target v0.0.21 Release target labels Apr 20, 2026
cv

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged latest main into this branch, resolved the onboard.ts conflict, incorporated the remaining gateway bootstrap review fixes, and re-ran the focused gateway startup wait test.

@cv cv merged commit 360583d into NVIDIA:main Apr 21, 2026
@miyoungc miyoungc mentioned this pull request Apr 22, 2026
13 tasks
miyoungc added a commit that referenced this pull request Apr 22, 2026
## Summary

Bumps the published doc version to `0.0.22` and documents the
user-visible CLI behavior changes to `nemoclaw <name> connect` that
landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs`
skill against commits in `v0.0.21..origin/main`, filtered through
`docs/.docs-skip`.

## Changes

- **`docs/project.json`** and **`docs/versions1.json`**: bump the
published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry
into the version list so the history stays contiguous.
- **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document
two new behaviors.
- Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds;
default `120`) that replaces the silent hang when the sandbox is not yet
`Ready` — right after onboarding, while the 2.4 GB image is still
pulling (#466).
- Post-connect hint is now agent-aware, names the correct TUI command
for the sandbox's agent, and tells you to use `/exit` to leave the chat
before `exit` returns you to the host shell (#2080).

Feature PRs that shipped their own docs in the same commit are
intentionally not re-documented here:

- `channels list/add/remove` (#2139) — command reference and the
"`openclaw channels` blocked inside the sandbox" troubleshooting entry
landed with the feature.
- `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild
image cleanup PR.

Skipped per `docs/.docs-skip`:

- `e6bad533 fix(shields): verify config lock and fail hard on re-lock
failure (#2066)` — matched `skip-features: src/lib/shields.ts`.

Other commits in the range (#2141 OpenShell version bump, #1819 plugin
banner live inference probe, #2085 / #2146 Slack Socket Mode fixes,
#2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway
bootstrap recovery, and assorted CI / test / install plumbing) are
internal behavior refinements with no doc-relevant surface change.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files via the
pre-commit hook, including `Regenerate agent skills from docs` (source ↔
generated parity confirmed)
- [ ] `npm test` passes — skipped; the one pre-existing
`test/cli.test.ts > unknown command exits 1` failure on `origin/main` is
unrelated to these markdown/JSON-only changes
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* `connect` now displays the sandbox phase while waiting for readiness
and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default
120s).
* TTY hints are agent-aware and instruct using `/exit` before returning
to the host shell.

* **Documentation**
  * Command docs updated to describe polling, timeout, and TTY guidance.
* Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package
version bumped to 0.0.22).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents v0.0.22 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants