fix(nim): bound curl health probes with connect and max timeouts#1818
fix(nim): bound curl health probes with connect and max timeouts#1818cv merged 3 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaced unbounded curl probes with bounded probes by adding Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/nim.ts (1)
212-215: Add regression assertions for timeout flags in probe tests.Line 213 correctly hardens the probe, but current
src/lib/nim.test.tscoverage (around Lines 176 and 241) only validates URL/flow and not the presence of--connect-timeout 5 --max-time 5. Please add explicit command-assertions so this doesn’t regress silently.Proposed test assertion pattern
const commands = runCapture.mock.calls.map(([cmd]: [string]) => cmd); +expect( + commands.some((cmd: string) => cmd.includes("curl -sf --connect-timeout 5 --max-time 5")), +).toBe(true);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/nim.ts` around lines 212 - 215, The probe invocation builds a curl command via runCapture that should include the timeout flags but tests only assert URL/flow; update the probe tests in nim.test.ts to explicitly assert the generated command string contains "--connect-timeout 5" and "--max-time 5" (e.g., by matching the runCapture call or its captured command output used in those probe tests around the probe assertions) so the test fails if those flags are ever removed or changed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/lib/nim.ts`:
- Around line 212-215: The probe invocation builds a curl command via runCapture
that should include the timeout flags but tests only assert URL/flow; update the
probe tests in nim.test.ts to explicitly assert the generated command string
contains "--connect-timeout 5" and "--max-time 5" (e.g., by matching the
runCapture call or its captured command output used in those probe tests around
the probe assertions) so the test fails if those flags are ever removed or
changed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b2780c27-00ce-43d9-b243-ebc54e5cd72d
📒 Files selected for processing (1)
src/lib/nim.ts
|
✨ Possibly related open PRs: |
|
@WuKongAI-CMU Correction on my earlier comment — DCO check on this repo looks at the PR body, not the commits. Just add this line anywhere in your PR description: Edit the PR body, add the line, and DCO should pass. |
Automated PR review summaryReviewed PR #1818: fix(nim): bound curl health probes with connect and max timeouts Recommendation
Scope and PR goals
Installation and setup findings
What was validated
.agents/skills/nemoclaw-maintainer-triage/SKILL.md | 145 ------- Failing tests and unresolved impact
Passing tests and why they matteredPassing test 1: waitForNimHealth stops retrying instead of hanging forever on a silent endpoint
Passing test 2: nimStatusByName fails fast on a running container whose health endpoint silently hangs
Passing test 3: OpenShell sandbox curl honors the 5s bound expected by this fix
Bottom line
Automatically functional review above. Please see previous comment to make sure DCO is setup. |
6e499fc to
1717782
Compare
The NIM health probe loop and nimStatusByName() use `curl -sf` without any timeout. On a hung or silently-dropping NIM endpoint the curl call can block indefinitely, stalling onboard/waitForNimHealth and CLI status. Add `--connect-timeout 5 --max-time 5` to both probe calls so each attempt fails fast on unresponsive endpoints and the outer retry loop can make progress. Signed-off-by: Intern Dev <dev@wukongai.io>
1717782 to
01b81fe
Compare
Resolve upstream main into the NIM curl timeout branch while preserving argv-array health probes, loopback-only URLs, and bounded curl connect/total timeouts. Constraint: PR NVIDIA#1818 was CONFLICTING/DIRTY after upstream changed NIM probes to 127.0.0.1 argv arrays. Rejected: Keep localhost probes | upstream intentionally narrowed probes to loopback literals. Confidence: high Scope-risk: narrow Tested: npm run build:cli Tested: npm test -- src/lib/nim.test.ts Tested: git diff --check Not-tested: full npm test suite Signed-off-by: Intern Dev <dev@wukongai.io>
|
Refreshed this branch against current main and resolved the NIM health-probe conflict in 6bd4f96. The merge keeps upstream's 127.0.0.1 argv-array probe shape while preserving the bounded curl flags (--connect-timeout 5 and --max-time 5).\n\nLocal verification:\n- npm run build:cli\n- npm test -- src/lib/nim.test.ts (17 passed)\n- git diff --check |
cv
left a comment
There was a problem hiding this comment.
Looks good to me.
- adds bounded curl connect/overall timeouts in both NIM health probe call sites
- keeps the behavior narrow: healthy endpoints still pass, hung endpoints now fail fast so the outer retry logic can keep moving
- regression coverage checks the timeout args are present in both paths
I also ran local targeted validation on the PR branch:
npm run build:clinpx vitest run src/lib/nim.test.tsnpm run typecheck:cli
## Summary Bumps the published doc version to `0.0.22` and documents the user-visible CLI behavior changes to `nemoclaw <name> connect` that landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.21..origin/main`, filtered through `docs/.docs-skip`. ## Changes - **`docs/project.json`** and **`docs/versions1.json`**: bump the published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry into the version list so the history stays contiguous. - **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document two new behaviors. - Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds; default `120`) that replaces the silent hang when the sandbox is not yet `Ready` — right after onboarding, while the 2.4 GB image is still pulling (#466). - Post-connect hint is now agent-aware, names the correct TUI command for the sandbox's agent, and tells you to use `/exit` to leave the chat before `exit` returns you to the host shell (#2080). Feature PRs that shipped their own docs in the same commit are intentionally not re-documented here: - `channels list/add/remove` (#2139) — command reference and the "`openclaw channels` blocked inside the sandbox" troubleshooting entry landed with the feature. - `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild image cleanup PR. Skipped per `docs/.docs-skip`: - `e6bad533 fix(shields): verify config lock and fail hard on re-lock failure (#2066)` — matched `skip-features: src/lib/shields.ts`. Other commits in the range (#2141 OpenShell version bump, #1819 plugin banner live inference probe, #2085 / #2146 Slack Socket Mode fixes, #2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway bootstrap recovery, and assorted CI / test / install plumbing) are internal behavior refinements with no doc-relevant surface change. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files via the pre-commit hook, including `Regenerate agent skills from docs` (source ↔ generated parity confirmed) - [ ] `npm test` passes — skipped; the one pre-existing `test/cli.test.ts > unknown command exits 1` failure on `origin/main` is unrelated to these markdown/JSON-only changes - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * `connect` now displays the sandbox phase while waiting for readiness and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default 120s). * TTY hints are agent-aware and instruct using `/exit` before returning to the host shell. * **Documentation** * Command docs updated to describe polling, timeout, and TTY guidance. * Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package version bumped to 0.0.22). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
The NIM health probe loop and `nimStatusByName()` use `curl -sf` without any timeout in `src/lib/nim.ts`. On a hung or silently-dropping NIM endpoint, the curl call can block indefinitely, stalling `onboard`/`waitForNimHealth` and CLI `status`.
Add `--connect-timeout 5 --max-time 5` to both probe calls so each attempt fails fast on unresponsive endpoints and the outer retry loop can make progress.
Changes
Test plan
(Resubmitting — prior PR #1813 was auto-closed by the 10-PR-cap check when our open PR count briefly exceeded the limit; we're now under the cap.)
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests
Signed-off-by: Intern Dev dev@wukongai.io