Skip to content

fix(nim): bound curl health probes with connect and max timeouts#1818

Merged
cv merged 3 commits intoNVIDIA:mainfrom
WuKongAI-CMU:fix/nim-curl-timeouts-v2
Apr 21, 2026
Merged

fix(nim): bound curl health probes with connect and max timeouts#1818
cv merged 3 commits intoNVIDIA:mainfrom
WuKongAI-CMU:fix/nim-curl-timeouts-v2

Conversation

@WuKongAI-CMU
Copy link
Copy Markdown
Contributor

@WuKongAI-CMU WuKongAI-CMU commented Apr 13, 2026

Summary

The NIM health probe loop and `nimStatusByName()` use `curl -sf` without any timeout in `src/lib/nim.ts`. On a hung or silently-dropping NIM endpoint, the curl call can block indefinitely, stalling `onboard`/`waitForNimHealth` and CLI `status`.

Add `--connect-timeout 5 --max-time 5` to both probe calls so each attempt fails fast on unresponsive endpoints and the outer retry loop can make progress.

Changes

  • `src/lib/nim.ts:212` — `waitForNimHealth` loop probe
  • `src/lib/nim.ts:265` — `nimStatusByName` health probe

Test plan

  • Minimal diff, existing retry/interval behavior preserved
  • 5s timeout chosen to fit under the 5s inter-retry sleep interval
  • No new dependencies

(Resubmitting — prior PR #1813 was auto-closed by the 10-PR-cap check when our open PR count briefly exceeded the limit; we're now under the cap.)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Health readiness probes now enforce 5-second connection and overall timeouts for local HTTP checks, preventing probes from hanging indefinitely. This yields faster failure detection, quicker recovery, and reduced resource waits during startup and health checks.
  • Tests

    • Added and updated tests to verify health probes include the expected timeout behavior so regressions are caught early.

Signed-off-by: Intern Dev dev@wukongai.io

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaced unbounded curl probes with bounded probes by adding --connect-timeout 5 --max-time 5 to health-check invocations in nim health functions; tests were updated and expanded to assert the presence of these timeout flags and a new waitForNimHealth test was added.

Changes

Cohort / File(s) Summary
NIM health logic
src/lib/nim.ts
Added --connect-timeout 5 and --max-time 5 to curl invocations in waitForNimHealth and nimStatusByName; preserved retry loop, runCapture({ ignoreError: true }), output handling, and port-resolution logic.
Tests for NIM health
src/lib/nim.test.ts
Moved hasArg helper to file scope, added hasCurlTimeoutArgs helper, added a waitForNimHealth test asserting a true return and that at least one captured curl includes the timeout flags, and extended nimStatusByName test to assert the same timeout flags are present.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I counted beats, five seconds through,
A tiny hop, a timed-through view.
Curl on my whiskers, probe on the run,
Quick timeout set, the job is done.
I twitch my nose and nibble sun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding timeout bounds (connect and max) to curl health probes in the NIM module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/nim.ts (1)

212-215: Add regression assertions for timeout flags in probe tests.

Line 213 correctly hardens the probe, but current src/lib/nim.test.ts coverage (around Lines 176 and 241) only validates URL/flow and not the presence of --connect-timeout 5 --max-time 5. Please add explicit command-assertions so this doesn’t regress silently.

Proposed test assertion pattern
 const commands = runCapture.mock.calls.map(([cmd]: [string]) => cmd);
+expect(
+  commands.some((cmd: string) => cmd.includes("curl -sf --connect-timeout 5 --max-time 5")),
+).toBe(true);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/nim.ts` around lines 212 - 215, The probe invocation builds a curl
command via runCapture that should include the timeout flags but tests only
assert URL/flow; update the probe tests in nim.test.ts to explicitly assert the
generated command string contains "--connect-timeout 5" and "--max-time 5"
(e.g., by matching the runCapture call or its captured command output used in
those probe tests around the probe assertions) so the test fails if those flags
are ever removed or changed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/lib/nim.ts`:
- Around line 212-215: The probe invocation builds a curl command via runCapture
that should include the timeout flags but tests only assert URL/flow; update the
probe tests in nim.test.ts to explicitly assert the generated command string
contains "--connect-timeout 5" and "--max-time 5" (e.g., by matching the
runCapture call or its captured command output used in those probe tests around
the probe assertions) so the test fails if those flags are ever removed or
changed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b2780c27-00ce-43d9-b243-ebc54e5cd72d

📥 Commits

Reviewing files that changed from the base of the PR and between d4aac4c and ce84e5a.

📒 Files selected for processing (1)
  • src/lib/nim.ts

@wscurran wscurran added bug Something isn't working fix labels Apr 13, 2026
@wscurran
Copy link
Copy Markdown
Contributor


Possibly related open PRs:


Possibly related open PRs:

@cjagwani cjagwani self-assigned this Apr 14, 2026
@cjagwani cjagwani requested a review from cv April 14, 2026 22:59
@cjagwani
Copy link
Copy Markdown
Contributor

@WuKongAI-CMU Correction on my earlier comment — DCO check on this repo looks at the PR body, not the commits. Just add this line anywhere in your PR description:

Signed-off-by: Your Name <your@email.com>

Edit the PR body, add the line, and DCO should pass.

@brandonpelfrey
Copy link
Copy Markdown
Collaborator

Automated PR review summary

Reviewed PR #1818: fix(nim): bound curl health probes with connect and max timeouts

Recommendation

  • Recommendation: PASS
  • Highest observed severity: low
  • Block merge: no
  • Why: The highest-risk claim in this PR is that unresponsive NIM endpoints no longer stall onboarding/status indefinitely. I simulated silently hanging localhost endpoints and observed bounded failures in both waitForNimHealth and nimStatusByName. The only notable nuance is pre-existing behavior: waitForNimHealth still sleeps 5s between attempts, so total wall time can exceed the caller timeout by roughly one sleep interval, but that is consistent with the stated goal of preserving retry behavior and is not introduced by this PR.
  • Reviewer summary: I reviewed the PR with PR-specific adversarial probes against hung localhost endpoints. Both modified code paths now fail fast instead of blocking indefinitely, and the 5s curl cap was confirmed from inside the OpenShell sandbox as well.

Scope and PR goals

  • Repository: NVIDIA/nemoclaw
  • Author: WuKongAI-CMU
  • Base branch: main
  • Head SHA: 6e499fc
  • Changed files observed: 1
  • PR description summary: ## Summary The NIM health probe loop and `nimStatusByName()` use `curl -sf` without any timeout in `src/lib/nim.ts`. On a hung or silently-dropping NIM endpoint, the curl call can block indefinitely, stalling `onboard`/`waitForNimHealth` and CLI `status`. Add `--connect-timeout 5 --max-time 5` to both probe calls so each attempt fails fast on unresponsive endpoints and the outer retry loop can make progress. ## Changes - `src/lib/nim.ts:212` — `waitForNimHealth` loop probe - `src/lib/nim.ts:265` — `nimStatusByName` health probe ## Test plan - [x] Minimal diff, existing retry/interval behavior preserved - [x] 5s timeout chosen to fit under the 5s inter-retry sleep interval - [x] No new dependencies (Resubmitting — prior PR fix(nim): bound curl health probes with --max-time #1813 was auto-closed by the 10-PR-cap check w
    ...[truncated]

Installation and setup findings

  • Local installer flow completed and was verified end-to-end. I used install.sh against the checked-out repo, resumed an interrupted onboarding run when the first sandbox build path stalled, confirmed NemoClaw created sandbox 'my-assistant', SSHed into that existing sandbox, observed '2+2=4' from an in-sandbox command, and got '4' from an in-sandbox OpenClaw probe.

What was validated

  • The PR revision was checked out in an isolated review environment.
  • The local checkout was installed using the repository installer flow as closely as the environment allowed.
  • Adversarial, PR-specific probes were then run against the installed environment and relevant repository context.
  • Diff summary:

.agents/skills/nemoclaw-maintainer-triage/SKILL.md | 145 -------
.github/workflows/nightly-e2e.yaml | 62 +--
agents/hermes/manifest.yaml | 2 -
agents/openclaw/manifest.yaml | 2 -
docs/triage-instructions.md | 150 --------
nemoclaw/src/blueprint/state.ts | 4 -
nemoclaw/src/commands/slash.ts | 7 -
scripts/install.sh | 22 --
src/lib/agent-defs.ts | 12 -
src/lib/credential-filter.test.ts | 132 -------
src/lib/credential-filter.ts | 92 -----
src/lib/debug.ts | 17 -
src/lib/nim.ts | 9 +-
src/lib/onboard.ts | 3 -
src/lib/registry.ts | 2 -
src/lib/sandbox-state.ts | 427 ---------------------
src/lib/sandbox-version.test.ts | 212 ----------
src/lib/sandbox-version.ts | 145 -------
src/nemoclaw.ts | 280
...[truncated]

Failing tests and unresolved impact

  • No failing adversarial tests were captured.

Passing tests and why they mattered

Passing test 1: waitForNimHealth stops retrying instead of hanging forever on a silent endpoint

  • What was tested: The added curl flags make each health probe fail within about 5 seconds, so waitForNimHealth returns false rather than blocking indefinitely when localhost accepts connections but never responds.
  • Why it mattered: If false, onboard/waitForNimHealth can still hang forever against a wedged NIM endpoint, defeating the main purpose of the PR.
  • Observed result: Host-level integration probe against installed dist/lib/nim.js with a localhost server that accepts and hangs. Function returned {ok:false} in about 10.01s total, showing bounded curl plus the existing 5s retry sleep rather than an indefinite stall.
  • Command: bash /tmp/pr1818_probe.sh
  • Recommended follow-up coverage: Add an integration/regression test that stubs runCapture or points to a controlled hanging localhost socket and asserts the function returns within a bounded window rather than hanging.

Passing test 2: nimStatusByName fails fast on a running container whose health endpoint silently hangs

  • What was tested: The status path now uses the same 5s curl limit, so status for a running container with a hung /v1/models endpoint returns healthy:false promptly.
  • Why it mattered: If false, nemoclaw status could still block indefinitely on a bad NIM endpoint even after this PR.
  • Observed result: Host-level integration probe with PATH-prepended fake docker responses plus a hanging localhost server. nimStatusByName('fake-nim') returned running:true, healthy:false in about 5.01s.
  • Command: bash /tmp/pr1818_probe2.sh
  • Recommended follow-up coverage: Add an integration/regression test for nimStatusByName that mocks docker inspect/port and exercises a hanging health endpoint, asserting a bounded runtime and healthy:false result.

Passing test 3: OpenShell sandbox curl honors the 5s bound expected by this fix

  • What was tested: In the real OpenShell sandbox runtime, a hung localhost endpoint causes curl with --connect-timeout 5 --max-time 5 to abort in about 5 seconds, matching the PR's operational assumption.
  • Why it mattered: If false, the PR could be relying on timeout semantics that do not hold in the actual sandbox environment where NemoClaw runs.
  • Observed result: End-to-end OpenShell sandbox probe. Inside sandbox, a localhost server accepted and hung; curl exited rc=28 after elapsed=5s, confirming the runtime timeout behavior the PR depends on.
  • Command: ssh -F /tmp/sshcfg openshell-my-assistant 'bash /tmp/pr1818_sandbox_test.sh'
  • Recommended follow-up coverage: No additional unit test needed for curl itself, but an end-to-end regression test around sandboxed NIM health checking would be valuable if this path has caused operational hangs before.

Bottom line

  • Based on the install evidence and adversarial probes, this PR looks reasonable to approve.

Automatically functional review above. Please see previous comment to make sure DCO is setup.

@WuKongAI-CMU WuKongAI-CMU force-pushed the fix/nim-curl-timeouts-v2 branch from 6e499fc to 1717782 Compare April 15, 2026 01:36
The NIM health probe loop and nimStatusByName() use `curl -sf` without
any timeout. On a hung or silently-dropping NIM endpoint the curl call
can block indefinitely, stalling onboard/waitForNimHealth and CLI status.

Add `--connect-timeout 5 --max-time 5` to both probe calls so each
attempt fails fast on unresponsive endpoints and the outer retry loop
can make progress.

Signed-off-by: Intern Dev <dev@wukongai.io>
@WuKongAI-CMU WuKongAI-CMU force-pushed the fix/nim-curl-timeouts-v2 branch from 1717782 to 01b81fe Compare April 15, 2026 17:13
@cv cv added the v0.0.18 Release target label Apr 16, 2026
Resolve upstream main into the NIM curl timeout branch while preserving argv-array health probes, loopback-only URLs, and bounded curl connect/total timeouts.

Constraint: PR NVIDIA#1818 was CONFLICTING/DIRTY after upstream changed NIM probes to 127.0.0.1 argv arrays.

Rejected: Keep localhost probes | upstream intentionally narrowed probes to loopback literals.

Confidence: high

Scope-risk: narrow

Tested: npm run build:cli

Tested: npm test -- src/lib/nim.test.ts

Tested: git diff --check

Not-tested: full npm test suite
Signed-off-by: Intern Dev <dev@wukongai.io>
@WuKongAI-CMU
Copy link
Copy Markdown
Contributor Author

Refreshed this branch against current main and resolved the NIM health-probe conflict in 6bd4f96. The merge keeps upstream's 127.0.0.1 argv-array probe shape while preserving the bounded curl flags (--connect-timeout 5 and --max-time 5).\n\nLocal verification:\n- npm run build:cli\n- npm test -- src/lib/nim.test.ts (17 passed)\n- git diff --check

@ericksoa ericksoa added v0.0.19 Release target v0.0.20 Release target and removed v0.0.18 Release target v0.0.19 Release target labels Apr 17, 2026
@cv cv added v0.0.21 Release target and removed v0.0.20 Release target labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

@cv cv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

  • adds bounded curl connect/overall timeouts in both NIM health probe call sites
  • keeps the behavior narrow: healthy endpoints still pass, hung endpoints now fail fast so the outer retry logic can keep moving
  • regression coverage checks the timeout args are present in both paths

I also ran local targeted validation on the PR branch:

  • npm run build:cli
  • npx vitest run src/lib/nim.test.ts
  • npm run typecheck:cli

@cv cv enabled auto-merge (squash) April 21, 2026 01:15
@cv cv added v0.0.22 Release target and removed v0.0.21 Release target labels Apr 21, 2026
@cv cv merged commit c6a0c79 into NVIDIA:main Apr 21, 2026
13 checks passed
@miyoungc miyoungc mentioned this pull request Apr 22, 2026
13 tasks
miyoungc added a commit that referenced this pull request Apr 22, 2026
## Summary

Bumps the published doc version to `0.0.22` and documents the
user-visible CLI behavior changes to `nemoclaw <name> connect` that
landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs`
skill against commits in `v0.0.21..origin/main`, filtered through
`docs/.docs-skip`.

## Changes

- **`docs/project.json`** and **`docs/versions1.json`**: bump the
published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry
into the version list so the history stays contiguous.
- **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document
two new behaviors.
- Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds;
default `120`) that replaces the silent hang when the sandbox is not yet
`Ready` — right after onboarding, while the 2.4 GB image is still
pulling (#466).
- Post-connect hint is now agent-aware, names the correct TUI command
for the sandbox's agent, and tells you to use `/exit` to leave the chat
before `exit` returns you to the host shell (#2080).

Feature PRs that shipped their own docs in the same commit are
intentionally not re-documented here:

- `channels list/add/remove` (#2139) — command reference and the
"`openclaw channels` blocked inside the sandbox" troubleshooting entry
landed with the feature.
- `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild
image cleanup PR.

Skipped per `docs/.docs-skip`:

- `e6bad533 fix(shields): verify config lock and fail hard on re-lock
failure (#2066)` — matched `skip-features: src/lib/shields.ts`.

Other commits in the range (#2141 OpenShell version bump, #1819 plugin
banner live inference probe, #2085 / #2146 Slack Socket Mode fixes,
#2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway
bootstrap recovery, and assorted CI / test / install plumbing) are
internal behavior refinements with no doc-relevant surface change.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files via the
pre-commit hook, including `Regenerate agent skills from docs` (source ↔
generated parity confirmed)
- [ ] `npm test` passes — skipped; the one pre-existing
`test/cli.test.ts > unknown command exits 1` failure on `origin/main` is
unrelated to these markdown/JSON-only changes
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* `connect` now displays the sandbox phase while waiting for readiness
and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default
120s).
* TTY hints are agent-aware and instruct using `/exit` before returning
to the host shell.

* **Documentation**
  * Command docs updated to describe polling, timeout, and TTY guidance.
* Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package
version bumped to 0.0.22).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix v0.0.22 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants