feat(router): add model router integration with complexity-based routing#2202
Conversation
Add support for the NVIDIA LLM Router v3 (prefill-based) as an inference provider in NemoClaw. The router automatically selects the most efficient model for each query based on a learned complexity signal. Architecture: - Router runs on the host (started by onboard, not inside the sandbox) - Sandbox reaches it through the OpenShell gateway's L7 proxy - Credentials flow through the OpenShell provider system Changes: - Add llm-router v3 as a git submodule at nemoclaw-blueprint/router/llm-router/ - Add pool-config.yaml defining the model pool and routing parameters - Add "Model Router" as a provider option in the onboard wizard - Add routed inference profile to blueprint.yaml (model: nvidia-routed) - Remove router lifecycle code from blueprint runner (onboard owns it) - Remove model-router-toolkit install from Dockerfile (host-only) - Add model-router-toolkit install to scripts/install.sh - Update README with Model Router documentation Made-with: Cursor
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a host-side complexity-based Model Router (blueprint, schema, pool config, submodule, installer, onboarding, runner, tests, and docs) and a new runtime-ready NemoClaw GitHub triage skill instruction document. ChangesModel Router Feature
Triage Skill Documentation
Sequence DiagramsequenceDiagram
participant Sandbox
participant OpenShell_Gateway
participant Model_Router
participant Provider_API
Sandbox->>OpenShell_Gateway: inference request
OpenShell_Gateway->>Model_Router: forward request (http://localhost:4000/v1)
Model_Router->>Model_Router: encode complexity & select model from pool
Model_Router->>Provider_API: proxy request to chosen provider (api_base + model)
Provider_API-->>Model_Router: model response
Model_Router-->>OpenShell_Gateway: return routed response
OpenShell_Gateway-->>Sandbox: relay final response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
Keep both our model router additions (loadBlueprintProfile, startModelRouter) and upstream's new pullAndResolveBaseImageDigest function in onboard.ts. Take upstream Dockerfile as-is (router installs on host, not in sandbox). Made-with: Cursor
# Conflicts: # nemoclaw/src/blueprint/runner.ts # scripts/install.sh # src/lib/onboard.ts
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
nemoclaw-blueprint/router/llm-router (2)
1-1: Establish a submodule update and governance strategy.Consider documenting:
- Update policy: Who is responsible for tracking upstream llm-router releases and determining when to update the submodule reference?
- Version alignment: Should the submodule track tagged releases (e.g., v3.x.x) rather than branch HEAD to ensure predictable behavior?
- Security monitoring: How will security advisories for the llm-router dependency be tracked and addressed?
- CI/CD considerations: Ensure CI workflows include
git submodule update --init --recursiveto avoid build failures when submodules are not initialized.The current pinning to a specific commit provides stability, but a defined process for evaluating and applying upstream updates will help maintain security and feature parity.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemoclaw-blueprint/router/llm-router` at line 1, Add a short governance and update policy for the llm-router submodule: document an owner responsible for tracking upstream llm-router releases, require pinning to tagged releases (e.g., v3.x.x) rather than branch HEAD for predictable behavior, and define a cadence/process for evaluating and applying upstream commits; include a security monitoring step (subscribe to upstream advisories or use Dependabot/Snyk) and a CI change to run git submodule update --init --recursive (ensure workflows that build or test reference the llm-router submodule perform this initialization) so builds won’t fail when submodules are uninitialized.
1-1: Consider pinning to a tagged release instead of an arbitrary commit.The submodule commit
2bd8dfaa751efb60aa4e7e49b270490dfbc0a68aexists in the upstream NVIDIA-AI-Blueprints/llm-router repository on branch v3. However, this commit is not associated with any release tag. Pinning to tagged releases (e.g., v3.0.0, v3.1.0) provides clearer version boundaries, improves maintainability, and makes dependency updates more traceable. If upstream repository publishes release tags for the v3 branch, update .gitmodules to reference the latest stable tag instead of this untagged commit.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemoclaw-blueprint/router/llm-router` at line 1, The submodule is pinned to an untagged commit (2bd8dfaa751efb60aa4e7e49b270490dfbc0a68a) from the NVIDIA-AI-Blueprints/llm-router v3 branch; update .gitmodules to point to a released tag (e.g., v3.0.0 or v3.1.0) instead of the commit SHA, and update any corresponding submodule init/update steps (git submodule set-url or the submodule "branch"/"tag" reference) so the repo uses the stable release tag for llm-router rather than the arbitrary commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemoclaw-blueprint/router/pool-config.yaml`:
- Around line 24-29: The pool entry currently has a mismatched identity: the
fields name: gpt-5.2 and litellm_model:
"openai/nvidia/nvidia/Nemotron-3-Nano-30B-A3B" must be aligned; either change
the name to match the routed model (e.g., name: nemotron-3-nano) or change
litellm_model to the correct GPT-5.2 model identifier so that the router’s
checkpoint (which uses the name field) maps accuracy predictions to the actual
model; update the first pool entry’s name or litellm_model to mirror the pattern
used in the working entry (nemotron-3-super → nemotron-3-super-v3).
In `@nemoclaw/src/blueprint/runner.ts`:
- Around line 149-161: The router validation currently allows components.router
= { enabled: true } without a pool_config_path; update the validation in
runner.ts (the block handling components.router in isBlueprint / validation) so
that when router.enabled === true you require a present, non-empty
pool_config_path (use the existing string validation helper rather than
isOptionalString) and keep the existing port checks; ensure the condition
rejects router objects with enabled true and pool_config_path missing/invalid so
broken routed blueprints fail validation instead of at startup.
In `@README.md`:
- Line 145: Replace the GitHub repo link in the sentence that mentions "NVIDIA
LLM Router v3" with either the official documentation URL for the NVIDIA LLM
Router (if available) or remove the hyperlink and keep the plain text name
"NVIDIA LLM Router v3"; ensure the README.md line referencing the router does
not link to a third-party code repo per guidelines and that the rest of the
sentence ("prefill routing engine" / "runs on the host as a LiteLLM proxy" /
"sandbox reaches it through the OpenShell gateway") remains unchanged.
In `@scripts/install.sh`:
- Around line 1341-1343: The routed install path silently degrades: instead of
no-op when pip3 is missing or swallowing failures with "|| true", make routed
installs fail fast by checking command_exists pip3 and testing the pip3 install
return code (the spin "Installing model router" pip3 install ... call) and
exiting non‑zero with an error message if pip3 is missing or the install fails;
likewise remove or replace any "|| true" used around submodule fetches for the
llm-router submodule so git fetch/checkout failures are checked and cause an
immediate error/exit; apply the same fail-fast checks to the other occurrences
around the llm-router installs mentioned (the blocks referencing
NEMOCLAW_SOURCE_ROOT/nemoclaw-blueprint/router/llm-router and any uses of
command_exists pip3, spin, and NEMOCLAW_PROVIDER=routed).
In `@src/lib/onboard.ts`:
- Around line 6620-6673: The provider recovery logic needs a reverse mapping for
the routed provider: update the providerNameToOptionKey function so that when
name === "nvidia-router" it returns "routed" (respecting the existing signature
providerNameToOptionKey(name, opts) and any opts.hasNimContainer logic); this
ensures the persisted provider set in onboard.ts (provider = bp.provider_name ||
"nvidia-router") is correctly mapped back to the "routed" selection during
non-interactive recovery rather than falling through to "build".
- Around line 854-887: The current health-check loop can mistakenly accept a
pre-existing /health on the same port while the spawned child has already
exited; capture the spawned child's lifecycle and only treat /health as success
if the child is still running. After calling spawn("model-router", ... ) and
before returning pid, attach an 'exit' handler or poll child.exitCode (or use
process.kill(pid, 0) safely) to detect early termination and save that state; in
the health loop, when a healthy HTTP response is seen, verify the child has not
exited (child.exitCode === null/undefined and not killed) and only then return
pid; if the child exited with a non-zero code or is not alive, throw an error
and continue retries. Ensure to reference the existing symbols: spawn call,
child, child.pid, child.on('exit') / child.exitCode, ROUTER_HEALTH_RETRIES,
ROUTER_HEALTH_INTERVAL_MS, litellmConfigPath, poolConfigPath, port,
blueprintDir, and buildSubprocessEnv when implementing this check.
- Around line 6627-6644: Resolve the routed credential by calling
hydrateCredentialEnv(routerCredentialEnv) (or equivalent) before the
non-interactive/interactive gate and use its returned value to populate
process.env[routerCredentialEnv]; ensure that hydrateCredentialEnv performs
legacy staging fallback so saved/staged credentials count in non-interactive
mode. Update the conditional checks that reference
process.env[routerCredentialEnv] (and bp.credential_default) to instead use the
hydrated result so callers like startModelRouter() / upsertProvider() receive
the credential, and keep the ensureNamedCredential(routerCredentialEnv, "Model
Router API key", null) call only when the hydrated value is absent and
interactive.
---
Nitpick comments:
In `@nemoclaw-blueprint/router/llm-router`:
- Line 1: Add a short governance and update policy for the llm-router submodule:
document an owner responsible for tracking upstream llm-router releases, require
pinning to tagged releases (e.g., v3.x.x) rather than branch HEAD for
predictable behavior, and define a cadence/process for evaluating and applying
upstream commits; include a security monitoring step (subscribe to upstream
advisories or use Dependabot/Snyk) and a CI change to run git submodule update
--init --recursive (ensure workflows that build or test reference the llm-router
submodule perform this initialization) so builds won’t fail when submodules are
uninitialized.
- Line 1: The submodule is pinned to an untagged commit
(2bd8dfaa751efb60aa4e7e49b270490dfbc0a68a) from the
NVIDIA-AI-Blueprints/llm-router v3 branch; update .gitmodules to point to a
released tag (e.g., v3.0.0 or v3.1.0) instead of the commit SHA, and update any
corresponding submodule init/update steps (git submodule set-url or the
submodule "branch"/"tag" reference) so the repo uses the stable release tag for
llm-router rather than the arbitrary commit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c5c311e1-1fc1-433f-832e-4571b576e60e
📒 Files selected for processing (15)
.agents/skills/nemoclaw-user-triage-instructions/SKILL.md.claude/skills.gitmodulesCLAUDE.mdCLAUDE.mdREADME.mdnemoclaw-blueprint/blueprint.yamlnemoclaw-blueprint/router/llm-routernemoclaw-blueprint/router/pool-config.yamlnemoclaw/src/blueprint/runner.test.tsnemoclaw/src/blueprint/runner.tsschemas/blueprint.schema.jsonscripts/install.shsrc/lib/onboard-providers.tssrc/lib/onboard.ts
💤 Files with no reviewable changes (1)
- .claude/skills
| const router = components.router; | ||
| if (router !== undefined) { | ||
| if (!isObjectLike(router)) { | ||
| return false; | ||
| } | ||
| if ( | ||
| !isOptionalBoolean(router.enabled) || | ||
| !(router.port === undefined || isValidPort(router.port)) || | ||
| !isOptionalString(router.pool_config_path) | ||
| ) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Require pool_config_path when the router is enabled.
isBlueprint() currently accepts components.router = { enabled: true }, and actionPlan() will later emit router.enabled: true with pool_config_path: undefined. That defers a broken routed blueprint until startup instead of rejecting it during validation.
🛡️ Suggested validation
if (
!isOptionalBoolean(router.enabled) ||
!(router.port === undefined || isValidPort(router.port)) ||
- !isOptionalString(router.pool_config_path)
+ !isOptionalString(router.pool_config_path) ||
+ (router.enabled === true && typeof router.pool_config_path !== "string")
) {
return false;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const router = components.router; | |
| if (router !== undefined) { | |
| if (!isObjectLike(router)) { | |
| return false; | |
| } | |
| if ( | |
| !isOptionalBoolean(router.enabled) || | |
| !(router.port === undefined || isValidPort(router.port)) || | |
| !isOptionalString(router.pool_config_path) | |
| ) { | |
| return false; | |
| } | |
| } | |
| const router = components.router; | |
| if (router !== undefined) { | |
| if (!isObjectLike(router)) { | |
| return false; | |
| } | |
| if ( | |
| !isOptionalBoolean(router.enabled) || | |
| !(router.port === undefined || isValidPort(router.port)) || | |
| !isOptionalString(router.pool_config_path) || | |
| (router.enabled === true && typeof router.pool_config_path !== "string") | |
| ) { | |
| return false; | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemoclaw/src/blueprint/runner.ts` around lines 149 - 161, The router
validation currently allows components.router = { enabled: true } without a
pool_config_path; update the validation in runner.ts (the block handling
components.router in isBlueprint / validation) so that when router.enabled ===
true you require a present, non-empty pool_config_path (use the existing string
validation helper rather than isOptionalString) and keep the existing port
checks; ensure the condition rejects router objects with enabled true and
pool_config_path missing/invalid so broken routed blueprints fail validation
instead of at startup.
|
|
||
| NemoClaw includes an optional model router that automatically picks the most efficient model for each query. Instead of sending every request to a single large model, the router uses a lightweight encoder to predict which model in a pool can handle each query correctly, then routes to the cheapest one that meets an accuracy threshold. | ||
|
|
||
| The router uses the [NVIDIA LLM Router v3](https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/v3) prefill routing engine and runs on the host as a LiteLLM proxy. The sandbox reaches it through the OpenShell gateway. |
There was a problem hiding this comment.
Replace the GitHub repo link with docs or plain text.
This sentence links directly to the llm-router repository. The Markdown rules here only allow official documentation links, so please swap it for the corresponding docs page or keep the reference as plain text.
As per coding guidelines, "Do not add links to third-party code repositories or community collections; only link to official tool documentation (Node.js, Python, uv)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` at line 145, Replace the GitHub repo link in the sentence that
mentions "NVIDIA LLM Router v3" with either the official documentation URL for
the NVIDIA LLM Router (if available) or remove the hyperlink and keep the plain
text name "NVIDIA LLM Router v3"; ensure the README.md line referencing the
router does not link to a third-party code repo per guidelines and that the rest
of the sentence ("prefill routing engine" / "runs on the host as a LiteLLM
proxy" / "sandbox reaches it through the OpenShell gateway") remains unchanged.
|
/ok to test e2e36d9 |
…tegration' into pr-2202 Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
/ok to test 8316fc779463db1804f333b0400f1ff5b6ebc15b |
@ericksoa, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 8316fc7 |
Keep both our model router additions (loadBlueprintProfile, startModelRouter) and upstream's new pullAndResolveBaseImageDigest function in onboard.ts. Take upstream Dockerfile as-is (router installs on host, not in sandbox). Made-with: Cursor Co-authored-by: Cursor <cursoragent@cursor.com>
8316fc7 to
f32b45d
Compare
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # Dockerfile # src/lib/onboard.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com> # Conflicts: # test/onboard.test.ts
|
/ok to test 58bb18b |
There was a problem hiding this comment.
♻️ Duplicate comments (4)
src/lib/onboard.ts (3)
6631-6647:⚠️ Potential issue | 🟠 Major | ⚡ Quick winResolve routed credential before the interactive/non-interactive gate.
Lines 6637-6647 only check
process.env, so saved/staged credentials are ignored here;bp.credential_defaultalso bypasses prompting without being written into env for downstream router/provider setup.Suggested fix
const routerCredentialEnv = bp.credential_env || "OPENAI_API_KEY"; credentialEnv = routerCredentialEnv; + const routedCredential = + hydrateCredentialEnv(routerCredentialEnv) || + normalizeCredentialValue(bp.credential_default || ""); + if (routedCredential) { + process.env[routerCredentialEnv] = routedCredential; + } const _providerKeyHint = (process.env.NEMOCLAW_PROVIDER_KEY || "").trim(); if (_providerKeyHint && !process.env[routerCredentialEnv]) { process.env[routerCredentialEnv] = _providerKeyHint; } if (isNonInteractive()) { - if (!process.env[routerCredentialEnv] && !bp.credential_default) { + if (!process.env[routerCredentialEnv]) { console.error( ` ${routerCredentialEnv} (or NEMOCLAW_PROVIDER_KEY) is required for Model Router in non-interactive mode.`, ); process.exit(1); } } else { - if (!process.env[routerCredentialEnv] && !bp.credential_default) { + if (!process.env[routerCredentialEnv]) { await ensureNamedCredential(routerCredentialEnv, "Model Router API key", null); } }Based on learnings,
hydrateCredentialEnv(envName)must resolve the specific key before fallback staging and should prevent stale credentials from overriding explicit env credentials.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 6631 - 6647, The code validates router credentials by only checking process.env and bp.credential_default, which skips staged or saved credentials; call hydrateCredentialEnv(routerCredentialEnv) (or the existing hydrateCredentialEnv function) before the interactive/non-interactive checks so any stored/staged key is resolved into the env and prevents stale fallbacks from overriding explicit env values; specifically, before using routerCredentialEnv/_providerKeyHint and before the isNonInteractive() branch, invoke hydrateCredentialEnv(routerCredentialEnv), then proceed to set credentialEnv and run the existing logic (including ensureNamedCredential) so downstream router/provider setup sees the resolved credential.
6666-6667:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd reverse mapping for persisted routed provider.
This branch persists
provider = "nvidia-router", but recovery mapping still lacks"nvidia-router" -> "routed", so non-interactive reruns can silently fall back to"build".Suggested fix (in
providerNameToOptionKey)function providerNameToOptionKey( name: string | null | undefined, opts: { hasNimContainer?: boolean } = {}, ): string | null { if (!name) return null; + if (name === "nvidia-router") return "routed"; if (name === "ollama-local") return "ollama";🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 6666 - 6667, The persisted provider value "nvidia-router" is not mapped back to the routed option, causing recovery to fall back to "build"; update the reverse mapping in the providerNameToOptionKey function to include an entry mapping "nvidia-router" (and any equivalent persisted keys) to "routed" so that when provider is read (e.g., provider variable set from bp.provider_name) it correctly resolves to the routed option; locate providerNameToOptionKey and add the "nvidia-router" -> "routed" case along with any related string variants to ensure non-interactive reruns recover the routed selection.
857-891:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent false-positive router startup when port 4000 is already serving
/health.At Line 881, success can come from a pre-existing listener; plus the spawned child can exit before the loop still returns success.
Suggested fix
+ const preexistingHealthy = await new Promise((resolve) => { + http + .get(`http://127.0.0.1:${port}/health`, (res: import("node:http").IncomingMessage) => + resolve((res.statusCode || 0) >= 200 && (res.statusCode || 0) < 300), + ) + .on("error", () => resolve(false)); + }); + if (preexistingHealthy) { + throw new Error( + `Port ${port} already has a healthy router endpoint; refusing to start a second router.`, + ); + } + const child = spawn( @@ child.unref(); + let exited = false; + child.once("exit", () => { + exited = true; + }); @@ - if (healthy) return pid; + if (healthy && !exited) return pid; + if (exited) break; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 857 - 891, The health-check loop can return a false positive if another process already serves /health or if the spawned child exited; update the logic around spawn/child and the loop (symbols: spawn call that creates child, child, pid, ROUTER_HEALTH_RETRIES, ROUTER_HEALTH_INTERVAL_MS) so that after receiving a healthy response you also verify the spawned process is still running (e.g., ensure child.exitCode === null and !child.killed or use process.kill(pid, 0) for existence) before returning pid; if the child has exited, treat it as not healthy and continue retries (and surface/log the child.exitCode/error), and consider failing after retries by killing/stopping the child if necessary.scripts/install.sh (1)
1334-1338:⚠️ Potential issue | 🟠 MajorFail routed installs when router prerequisites are unavailable.
install_model_router_if_present()still returns success whenpip3is missing or thellm-routercheckout is absent, and both submodule-init call sites still downgrade failures to warnings. WithNEMOCLAW_PROVIDER=routed, that lets step 2 succeed without a usable host router and pushes the failure into onboard.Suggested guard
install_model_router_if_present() { local root="$1" local router_dir="$root/nemoclaw-blueprint/router/llm-router" + local routed=0 + [[ "${NEMOCLAW_PROVIDER:-}" == "routed" ]] && routed=1 - command_exists pip3 || return 0 + if ! command_exists pip3; then + [[ "$routed" -eq 1 ]] && error "pip3 is required for routed inference." + return 0 + fi [[ -d "$router_dir" ]] || return 0 if [[ ! -f "$router_dir/pyproject.toml" && ! -f "$router_dir/setup.py" ]]; then - warn "Skipping model router install — llm-router submodule is not initialized." + [[ "$routed" -eq 1 ]] && error "llm-router is required for routed inference but is not initialized." + warn "Skipping model router install — llm-router submodule is not initialized." return 0 fi spin "Installing model router" pip3 install --quiet --user "${router_dir}[prefill,proxy]" } ... - spin "Initializing ${_CLI_DISPLAY} submodules" init_nemoclaw_submodules "$NEMOCLAW_SOURCE_ROOT" \ - || warn "Submodule initialization failed — model router support may be unavailable" + if ! spin "Initializing ${_CLI_DISPLAY} submodules" init_nemoclaw_submodules "$NEMOCLAW_SOURCE_ROOT"; then + [[ "${NEMOCLAW_PROVIDER:-}" == "routed" ]] && error "Failed to initialize the llm-router submodule required for routed inference." + warn "Submodule initialization failed — model router support may be unavailable" + fiAlso applies to: 1358-1359, 1382-1383
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/install.sh` around lines 1334 - 1338, install_model_router_if_present() currently returns success when prerequisites are missing (checks like command_exists pip3, [[ -d "$router_dir" ]], and existence of "$router_dir/pyproject.toml" or "$router_dir/setup.py"), allowing routed provider flows to continue without a router; change those early-return paths to propagate failure (non-zero exit) when NEMOCLAW_PROVIDER is "routed" so installs fail fast: replace the noop/return 0 behavior in the checks for pip3, router_dir, and missing pyproject/setup with a non-zero return (or exit) when the environment expects a routed provider, and update the places that call install_model_router_if_present() (the call sites that currently downgrade errors to warnings) to treat the function’s non-zero return as a fatal error instead of a warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@scripts/install.sh`:
- Around line 1334-1338: install_model_router_if_present() currently returns
success when prerequisites are missing (checks like command_exists pip3, [[ -d
"$router_dir" ]], and existence of "$router_dir/pyproject.toml" or
"$router_dir/setup.py"), allowing routed provider flows to continue without a
router; change those early-return paths to propagate failure (non-zero exit)
when NEMOCLAW_PROVIDER is "routed" so installs fail fast: replace the
noop/return 0 behavior in the checks for pip3, router_dir, and missing
pyproject/setup with a non-zero return (or exit) when the environment expects a
routed provider, and update the places that call
install_model_router_if_present() (the call sites that currently downgrade
errors to warnings) to treat the function’s non-zero return as a fatal error
instead of a warning.
In `@src/lib/onboard.ts`:
- Around line 6631-6647: The code validates router credentials by only checking
process.env and bp.credential_default, which skips staged or saved credentials;
call hydrateCredentialEnv(routerCredentialEnv) (or the existing
hydrateCredentialEnv function) before the interactive/non-interactive checks so
any stored/staged key is resolved into the env and prevents stale fallbacks from
overriding explicit env values; specifically, before using
routerCredentialEnv/_providerKeyHint and before the isNonInteractive() branch,
invoke hydrateCredentialEnv(routerCredentialEnv), then proceed to set
credentialEnv and run the existing logic (including ensureNamedCredential) so
downstream router/provider setup sees the resolved credential.
- Around line 6666-6667: The persisted provider value "nvidia-router" is not
mapped back to the routed option, causing recovery to fall back to "build";
update the reverse mapping in the providerNameToOptionKey function to include an
entry mapping "nvidia-router" (and any equivalent persisted keys) to "routed" so
that when provider is read (e.g., provider variable set from bp.provider_name)
it correctly resolves to the routed option; locate providerNameToOptionKey and
add the "nvidia-router" -> "routed" case along with any related string variants
to ensure non-interactive reruns recover the routed selection.
- Around line 857-891: The health-check loop can return a false positive if
another process already serves /health or if the spawned child exited; update
the logic around spawn/child and the loop (symbols: spawn call that creates
child, child, pid, ROUTER_HEALTH_RETRIES, ROUTER_HEALTH_INTERVAL_MS) so that
after receiving a healthy response you also verify the spawned process is still
running (e.g., ensure child.exitCode === null and !child.killed or use
process.kill(pid, 0) for existence) before returning pid; if the child has
exited, treat it as not healthy and continue retries (and surface/log the
child.exitCode/error), and consider failing after retries by killing/stopping
the child if necessary.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4b5a3a35-469e-4830-ac89-6b306e72e939
📒 Files selected for processing (5)
README.mdscripts/install.shsrc/lib/onboard-providers.tssrc/lib/onboard.tstest/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard-providers.ts
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
/ok to test 02a4ac4 |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/install.sh`:
- Around line 1336-1354: install_model_router_if_present currently calls pip3
install directly which can cause the whole script to exit on failure even when a
routed provider isn't requested; change the final step so the pip3 install (the
spin "Installing model router" pip3 install --quiet --user
"${router_dir}[prefill,proxy]") is executed and its exit status is checked, and
if it fails call error "..." only when is_routed_provider_requested returns
true, otherwise call warn "Skipping model router install — pip install failed"
(and do not hard-exit); implement this by replacing the direct spin+pip3 line
with an if ! spin ...; then if is_routed_provider_requested; error "pip3 install
of llm-router failed"; else warn "Skipping model router install — pip3 install
failed"; fi; fi so the script only hard-fails for routed installs.
- Around line 1330-1334: The is_routed_provider_requested function uses
Bash-4-only lowercase expansion (${provider,,}) which fails on macOS Bash 3.2;
change the normalization to a POSIX-compatible approach (e.g., pipe provider
through tr '[:upper:]' '[:lower:]' or use printf + tr) before comparing, so
replace the ${provider,,} usage in is_routed_provider_requested with a
POSIX-safe lowercase conversion and then test [[ "$provider" == "routed" ]] as
before.
In `@src/lib/onboard.ts`:
- Around line 6653-6692: The router is being started prematurely inside the
selected.key === "routed" provider-selection branch (see
loadBlueprintProfile("routed"), ensureNamedCredential, and startModelRouter
usage); move the actual startModelRouter(...) call and the onboarding state
mutation (onboardSession.updateSession(...).routerPid) out of this branch and
instead invoke a reconcile/start step at a committed lifecycle point for routed
sandboxes (e.g., after sandbox creation completes or in the resume/reuse path),
and add a verification/restart routine that runs on resume/reuse (and before
marking routed inference ready) to ensure the router process is present and
restarted if needed; keep credential collection
(hydrateCredentialEnv/ensureNamedCredential) in the selection branch but defer
starting or persisting the router PID until the centralized reconcile/start
logic (and call startModelRouter from there).
- Around line 835-843: The isRouterHealthy probe lacks a request timeout so
http.get can hang indefinitely; update the isRouterHealthy function to abort or
timeout the request after a short period (e.g., a few seconds) and
resolve(false) on timeout or error. Locate isRouterHealthy and the http.get call
(using the port variable) and attach a timeout handler—either call
request.setTimeout(...) to destroy/abort the socket or use an AbortController
with a timer—to ensure the promise resolves false on timeout and cleans up
listeners so await isRouterHealthy() cannot block forever.
In `@test/onboard.test.ts`:
- Line 109: Add a runtime guard check inside isOnboardTestInternals to validate
the newly exposed member providerNameToOptionKey: ensure the internals object
has own property "providerNameToOptionKey" and that typeof
internals.providerNameToOptionKey === "function" (or allow null/undefined if
your type permits) so the upfront type guard fails clearly if a stale/dist drift
removes or changes this helper; update the guard logic used by any callers of
providerNameToOptionKey (e.g., tests referenced in onboard.test.ts) to rely on
this enhanced check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e0559819-9be4-4f08-87bc-57f6ec981541
📒 Files selected for processing (3)
scripts/install.shsrc/lib/onboard.tstest/onboard.test.ts
|
/ok to test 56e68b6 |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard-session.ts (1)
118-134:⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
routerPidinSessionUpdatesis never persisted via update flows.
filterSafeUpdatesdoes not handlerouterPid, so passing it tomarkStepComplete,completeSession, etc. will silently drop the value. IfrouterPidupdates are expected to flow throughSessionUpdates, add a case infilterSafeUpdates:if (typeof updates.nimContainer === "string") safe.nimContainer = updates.nimContainer; + if (typeof updates.routerPid === "number" && Number.isInteger(updates.routerPid) && updates.routerPid > 0) { + safe.routerPid = updates.routerPid; + }If
routerPidis intentionally only set at session creation, consider removing it fromSessionUpdatesto avoid confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard-session.ts` around lines 118 - 134, SessionUpdates includes routerPid but filterSafeUpdates currently omits routerPid so any routerPid passed into update flows (e.g., markStepComplete, completeSession) gets dropped; either add routerPid to the allowed fields in filterSafeUpdates so it is propagated through update flows, or if routerPid should only be set at creation remove it from the SessionUpdates interface to avoid misuse—update the filterSafeUpdates implementation and adjust call sites (markStepComplete, completeSession) accordingly to ensure routerPid is preserved when intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 805-808: When constructing the router object (variable router
built from parsed?.components?.router and profile), avoid unconditionally
overwriting an existing parsed.components.router.credential_env with
profile.credential_env; instead only apply profile.credential_env when it is
explicitly set (not undefined/null), and otherwise preserve
parsed?.components?.router.credential_env. Update the router creation logic (the
spread of parsed?.components?.router and the assignment of credential_env) so
that credential_env is conditionally included from profile (or falls back to the
existing parsed value) to prevent silently clobbering a valid router-level
credential_env.
- Around line 961-971: reconcileModelRouter currently calls
hydrateCredentialEnv(routerCredentialEnv) but proceeds to startModelRouter even
if the resolved credential is absent; update reconcileModelRouter to read the
resolved credential name (routerCredentialEnv or bp.router.credential_env) after
hydrateCredentialEnv and assert the corresponding value is present (e.g.,
process.env[routerCredentialEnv] or the module's credential getter) and if
missing log an explicit error and fail fast (return/throw) before calling
startModelRouter(bp.router); this check should reference hydrateCredentialEnv,
routerCredentialEnv, and startModelRouter to locate the code to change.
---
Outside diff comments:
In `@src/lib/onboard-session.ts`:
- Around line 118-134: SessionUpdates includes routerPid but filterSafeUpdates
currently omits routerPid so any routerPid passed into update flows (e.g.,
markStepComplete, completeSession) gets dropped; either add routerPid to the
allowed fields in filterSafeUpdates so it is propagated through update flows, or
if routerPid should only be set at creation remove it from the SessionUpdates
interface to avoid misuse—update the filterSafeUpdates implementation and adjust
call sites (markStepComplete, completeSession) accordingly to ensure routerPid
is preserved when intended.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2156c618-74c5-4b75-b79e-29b1a6c52fcf
📒 Files selected for processing (4)
scripts/install.shsrc/lib/onboard-session.tssrc/lib/onboard.tstest/onboard.test.ts
|
/ok to test 0c26b52 |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 869-878: The code builds credEnvVars (using buildSubprocessEnv,
routerCfg.credential_env, and environment keys like
NVIDIA_API_KEY/OPENAI_API_KEY/_providerKey) and injects it when starting a
router in startModelRouter(), but reconcileModelRouter() may reuse an existing
healthy listener without verifying those credentials, allowing stale API keys to
remain after rotations; update reconcileModelRouter() to detect changes to the
effective routed credential (compare the current process.env-derived credential
value for routerCfg.credential_env or the resolved provider key against the
listener's last-started credential) and force-restart or replace the child
listener if the credential differs, or alternatively ensure the child is always
started with the new credEnvVars (propagate the new buildSubprocessEnv output)
so listeners never continue with revoked keys; apply the same check/fix to the
analogous logic around the other block noted (lines ~958-973).
- Around line 6714-6736: The code selects routerCredentialEnv from
bp.credential_env causing mismatch when a blueprint sets
components.router.credential_env; change the selection to prefer
bp.router?.credential_env first (e.g., compute routerCredentialEnv =
bp.router?.credential_env || bp.credential_env || "OPENAI_API_KEY") so all
subsequent uses (hydrateCredentialEnv, process.env assignment, the
_providerKeyHint check, isNonInteractive handling and ensureNamedCredential
calls) consistently use the router-specific env key; also update the other
occurrence (the similar block referenced around reconcileModelRouter) to use the
same router-first lookup so loadBlueprintProfile/reconcileModelRouter paths are
aligned.
- Around line 886-913: The spawn() child process may emit an 'error' before pid
is available; add a child.once('error', (err) => { ... }) listener immediately
after creating child (and before calling child.unref() and before checking const
pid) to capture spawn failures and convert them to a thrown or logged onboarding
error (include err.message), then remove or ignore the listener if the process
starts successfully; update the block around the spawn call, child.unref(),
const pid check, and the existing child.once('exit', ...) so the 'error' event
is handled deterministically for the model-router child.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 75ce3224-787c-41d6-a7c2-9aba6cbd816d
📒 Files selected for processing (2)
src/lib/onboard-session.tssrc/lib/onboard.ts
|
/ok to test 89aebda |
ericksoa
left a comment
There was a problem hiding this comment.
Approved after resolving the remaining router pool model-identity issue. Current head is mergeable, CodeRabbit is passing, and focused local validation passed: npm run build:cli; npx vitest run nemoclaw/src/blueprint/runner.test.ts test/onboard.test.ts.
## Summary - Bump docs metadata to 0.0.36 and refresh generated NemoClaw user skills. - Document Model Router onboarding, validation retries, Ollama tool checks, Hermes policy behavior, and deployment verification updates. - Remove suppressed experimental command references from public docs per `docs/.docs-skip`. ## Source summary - #2202 -> `docs/get-started/quickstart.md`, `docs/inference/inference-options.md`, `docs/reference/architecture.md`: Document Model Router setup and routed inference architecture. - #3128 -> `docs/get-started/quickstart.md`, `docs/reference/commands.md`: Document deployment verification and HTTP 401 health handling. - #3104 -> `docs/inference/inference-options.md`: Document retry behavior for transient provider validation failures. - #3121 -> `docs/reference/architecture.md`: Document agent-scoped model/provider compatibility manifests. - #3046 -> `docs/reference/architecture.md`: Tie model-specific compatibility setup to known model/provider behavior. - #3097 -> `docs/inference/use-local-inference.md`: Document Ollama tool-calling capability validation. - #3082 -> `docs/reference/commands.md`: Document `NEMOCLAW_SANDBOX_NAME` as the interactive sandbox-name default. - f586cc5, 3442adf -> `docs/get-started/quickstart-hermes.md`, `docs/reference/network-policies.md`: Document Hermes agent-specific baseline policy endpoints. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - `rg` skip-term scan for `docs/` and generated user skills Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Model Router provider for complexity-based routed inference. * Ollama/local inference onboarding now validates tool-calling capability. * Added `local-inference` network policy preset. * **Documentation** * New integration policy examples (Outlook, Telegram, Slack, Discord, GitHub, Jira, etc.). * Clarified config immutability workflow and sandbox writable paths. * Hermes baseline network policy documented. * **Improvements** * Health checks treat device-auth responses as live; transient validation retries. * Installer performs pre-install reachability checks; CLI onboarding gained a --fresh option. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Changes
Verification
Test plan
Made with Cursor
Signed-off-by: Vinay Bagade vbagade@nvidia.com
Summary by CodeRabbit
New Features
Documentation
Tests