Skip to content

Add resource profiles to nemoclaw onboard and nemoclaw resources command #3336

@nvshridhar

Description

@nvshridhar

Problem Statement

Summary

Add percentage-based CPU/RAM resource profiles to nemoclaw onboard and a new nemoclaw resources command for hardware inventory. Resource limits are applied to sandbox pods via OpenShell's --cpu-request/--cpu-limit/--memory-request/--memory-limit flags (OpenShell PR #1063).

Motivation

Currently, sandbox pods run with default K8s resource allocations (BestEffort QoS). On shared machines (DGX Spark, laptops running other workloads), this leads to:

  • Sandbox consuming all available CPU/RAM, starving host-side processes (Ollama, IDE)
  • No predictable resource budgeting for multi-sandbox scenarios
  • No visibility into hardware capacity vs. sandbox allocation

Proposed Design

Proposed Changes

1. nemoclaw resources command (hardware inventory)

  • Reports CPU cores, RAM, GPU VRAM, and K8s allocatable capacity
  • Supports --json output for scripting
  • Uses K8s allocatable (not host totals) when available

2. Resource profile integration in nemoclaw onboard

  • Interactive picker with pre-defined profiles (creator, gamer, developer, etc.)
  • Custom profile option with percentage or absolute values
  • Environment variable overrides: NEMOCLAW_RESOURCE_PROFILE, NEMOCLAW_CPU_LIMIT, etc.
  • Percentage resolution against K8s allocatable capacity (e.g., "25%" of 22 cores = 5 cores)
  • Graceful degradation: when OpenShell CLI lacks resource flags, displays resolved values but skips enforcement

3. Blueprint schema update

  • resource_profiles section in blueprint.yaml with percentage-based patterns (e.g., cpu_limit: "25%")
  • JSON schema validation for 1%–100% whole-number percentages

Testing

Verified end-to-end on cgroup v2:

  • Pod spec: resources.requests.cpu=5, resources.limits.cpu=11, resources.requests.memory=7Gi, resources.limits.memory=15Gi
  • cgroup enforcement: cpu.max=1100000/100000 (11 cores), memory.max=16106127360 (15Gi)
  • CPU burn test: 100 throttle events in 10s (22 threads capped at 11 cores)
  • Memory OOM test: exit 137 when exceeding 15Gi limit

Dependencies

  • OpenShell PR #1063: adds --cpu-request/--cpu-limit/--memory-request/--memory-limit flags to openshell sandbox create

Scope

This is PR 1/2. A follow-up PR will add:

  • nemoclaw <sandbox> resize (live resource adjustment)
  • nemoclaw <sandbox> verify (cgroup validation)

Alternatives Considered

No response

Category

enhancement: feature

Checklist

  • I searched existing issues and this is not a duplicate
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: cliCommand line interface, flags, terminal UX, or outputarea: inferenceInference routing, serving, model selection, or outputsarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: dgx-sparkAffects DGX Spark hardware or workflows
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions