Skip to content

[macOS][Onboard] Model Router venv build fails when latest host python is broken; no fallback #3781

@hulynn

Description

@hulynn

Description

Choosing Option 8 "Model Router (experimental)" in nemoclaw onboard crashes at step [4/8] Setting up inference provider because the host-side Model Router venv setup unconditionally picks the highest-version python3.X on PATH and runs python3.X -m ensurepip --upgrade --default-pip. If that python is broken at the stdlib level (e.g. Homebrew python@3.14 whose pyexpat extension links a libexpat symbol the system libexpat does not export), the bootstrap fails and onboarding aborts before any sandbox is built.

Three NemoClaw-side issues compound the host environment problem:

  1. No health probe — NemoClaw does not run a smoke test on the candidate interpreter (e.g. python3.X -c "import pyexpat, ensurepip, ssl") before adopting it.
  2. No fallback — even though a healthy python3.11 was present on the same PATH, NemoClaw never tried it.
  3. Error message hides the real cause — the user only sees "Failed to create Model Router virtual environment" with a one-line ensurepip exit-status reference; nothing points at "your host python is broken, here is the import error". A new user will reasonably blame NemoClaw.

Environment

Device:        MacBook (M4, Apple Silicon)
OS:            macOS 26.1 (Darwin 25.1.0)
Architecture:  arm64
Node.js:       v23.10.0
npm:           11.3.0
Docker:        27.4.0 (Colima context)
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.44
OpenClaw:      2026.4.24 (cbcfdf6) (sandbox build was never reached)
Pythons on PATH:
  /opt/homebrew/bin/python3.11  — healthy, pyexpat OK, ensurepip OK
  /opt/homebrew/bin/python3.14  — Homebrew python@3.14 3.14.5, pyexpat broken

Steps to Reproduce

  1. On a macOS host where python3.14 exists on PATH but its stdlib is broken (current Homebrew python@3.14 when system libexpat is older than the libexpat NemoClaw's pyexpat was built against). To force this state quickly:

    brew install python@3.14
    /opt/homebrew/bin/python3.14 -c "import pyexpat"
    # If this raises "Symbol not found: _XML_SetAllocTrackerActivationThreshold" the host repros the bug.
  2. Run:

    nemoclaw onboard
  3. At the Choose [1]: prompt type 8 to select Model Router (experimental).

  4. Pick the default sandbox name and confirm with Y at the Review step.

  5. Onboard advances to [4/8] Setting up inference providerStarting model router...Initializing Model Router source...Preparing Model Router environment: /Users/<you>/.nemoclaw/model-router-venv.

  6. ensurepip exits non-zero, onboard aborts.

Expected Result

  • NemoClaw smoke-tests each candidate python interpreter (at minimum import ensurepip, pyexpat, ssl) before adopting it for venv creation.
  • If the highest-version python fails the probe, NemoClaw falls back to the next-highest healthy python (python3.11 in this case is right there on PATH).
  • If no candidate is healthy, the error surfaced to the user names the actual failing import (pyexpat dlopen error, missing symbol) and points at the broken host python — not just "Failed to create Model Router virtual environment".
  • Pin a known-supported python version range in docs (e.g. 3.11–3.13) so users know what to install.

Actual Result

Onboard output (Option 8) before the crash:

Inference options:
  8) Model Router (experimental)
Choose [1]: 8
✓ Using Model Router: nvidia-router / nvidia-routed
Sandbox name (...) [my-assistant]: lynntest

Review configuration
Provider:      nvidia-router
Model:         nvidia-routed
Apply this configuration? [Y/n]: Y

[4/8] Setting up inference provider
✓ Active gateway set to 'nemoclaw'
Starting model router...
Initializing Model Router source...
Submodule path 'nemoclaw-blueprint/router/llm-router': checked out '2bd8dfaa751efb60aa4e7e49b270490dfbc0a68a'
Cloning into '/Users/lynnh/.nemoclaw/source/nemoclaw-blueprint/router/llm-router'...
Preparing Model Router environment: /Users/lynnh/.nemoclaw/model-router-venv
Error: Command '['/Users/lynnh/.nemoclaw/model-router-venv/bin/python3.14', '-m', 'ensurepip', '--upgrade', '--default-pip']' returned non-zero exit status 1.
✗ Failed to start model router: Failed to create Model Router virtual environment.

Onboard exits non-zero. No sandbox is created. The user cannot use Model Router at all on this host until they manually fix Homebrew python@3.14 — even though python3.11 is right there and would work.

Logs

Running the same ensurepip command by hand reveals the real root cause:

$ /opt/homebrew/bin/python3.14 -m ensurepip --upgrade --default-pip --verbose
...
  File "/opt/homebrew/Cellar/python@3.14/3.14.5/Frameworks/Python.framework/Versions/3.14/lib/python3.14/xml/parsers/expat.py", line 4, in <module>
    from pyexpat import *
ImportError: dlopen(/opt/homebrew/Cellar/python@3.14/3.14.5/Frameworks/Python.framework/Versions/3.14/lib/python3.14/lib-dynload/pyexpat.cpython-314-darwin.so, 0x0002):
  Symbol not found: _XML_SetAllocTrackerActivationThreshold
  Referenced from: pyexpat.cpython-314-darwin.so
  Expected in:     /usr/lib/libexpat.1.dylib

Smoke-testing the two pythons NemoClaw could have picked:

$ /opt/homebrew/bin/python3.11 -c "import pyexpat; print('OK', pyexpat.version_info)"
OK (2, 7, 1)

$ /opt/homebrew/bin/python3.14 -c "import pyexpat"
ImportError: dlopen(...pyexpat.cpython-314-darwin.so): Symbol not found: _XML_SetAllocTrackerActivationThreshold

NemoClaw selected python3.14 anyway.

Related Bugs

Adjacent Model Router bugs (different stages, not duplicates of this):

  • NVB#6180064 — Model Router accepts nvapi- key but LiteLLM proxy rejects it (post-onboard config).
  • NVB#6158321 — Model Router HTTP 503 after successful onboard (post-onboard runtime).
  • NVB#6158324"Model Router API key:" prompt didn't document where to get the key (closed/fixed).

This one fails earlier than all three — venv setup, before any sandbox is built.


NVB#6189271

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamneeds: triageAwaiting maintainer classification

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions