Model performance / capability audit across supported agents

## Purpose

This issue is the source of truth for the model performance and capability audit across the models and agent surfaces NemoClaw supports.

The goal is not just to confirm that each model can answer a one-shot prompt. The goal is to verify that each supported model works well as an agent model in the NemoClaw/OpenShell environment: tool calls, shell execution, multi-turn tool-result continuation, sub-agent delegation where applicable, and the provider-specific response shapes that our agents consume.

This is related to, but separate from, #3120:

- This issue tracks the audit matrix, evidence, pass/fail state, and follow-up work.
- #3120 tracks the architecture for organizing model-specific sandbox/OpenClaw setup records once an intervention is justified.

## Background

PR #3046 fixed a concrete Kimi K2.6/OpenClaw incompatibility where `moonshotai/kimi-k2.6` could emit a combined shell command such as `hostname; date; uptime` as one `exec` call. OpenClaw needs separate tool-call boundaries for persistence, replay, and tool-result correlation. The individual Kimi issue is closed as fixed in #2620.

That fix exposed the broader product requirement: every model exposed through onboarding should be validated as an agent model, not merely as a chat model. Some models need model-aware or provider-aware affordances to work correctly in shell-agent loops. Those affordances must be discovered, documented, tested, and either captured in the model-specific setup registry proposed by #3120 or classified as provider-class transport policy.

Initial audit artifact:

- `model-affordance-audit.md` generated from `main` at `f5b8144d577ccd680875291d33eaabb656509d5a`

## Agent surfaces in scope

Audit the model behavior against the agent surfaces NemoClaw currently supports:

- [ ] OpenClaw primary `main` agent through the default NemoClaw sandbox path
- [ ] OpenClaw CLI prompt path, including shell/tool execution trajectories
- [ ] OpenClaw browser/gateway path when it changes request/response behavior from the CLI path
- [ ] OpenClaw sub-agent delegation through `sessions_spawn` / `agents.list`
- [ ] NemoHermes / Hermes sandbox path and OpenAI-compatible API surface
- [ ] Task-specific auxiliary models documented by NemoClaw examples, such as the Omni vision sub-agent pattern, when credentials and runnable test coverage are available

Messaging integrations are not separate model-capability targets unless the message channel changes model routing or response handling. The core model audit should run at the agent/runtime boundary first.

## Supported model inventory to audit

### NVIDIA Endpoints

- [x] `nvidia/nemotron-3-super-120b-a12b`
- [x] `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`
- [x] `z-ai/glm-5.1`
- [x] `minimaxai/minimax-m2.7`
- [x] `moonshotai/kimi-k2.6`
- [x] `openai/gpt-oss-120b`
- [x] `deepseek-ai/deepseek-v4-pro`

### OpenAI

- [ ] `gpt-5.4`
- [ ] `gpt-5.4-mini`
- [ ] `gpt-5.4-nano`
- [ ] `gpt-5.4-pro-2026-03-05`

### Anthropic

- [x] `claude-sonnet-4-6`
- [x] `claude-haiku-4-5`
- [x] `claude-opus-4-6`

### Gemini

- [x] `gemini-3.1-pro-preview`
- [ ] `gemini-3.1-flash-lite-preview`
- [ ] `gemini-3-flash-preview`
- [x] `gemini-2.5-pro`
- [ ] `gemini-2.5-flash`
- [ ] `gemini-2.5-flash-lite`

### Local and experimental providers

- [ ] Local Ollama default path, including `qwen2.5:7b`
- [ ] Local Ollama default path, including `nemotron-3-nano:30b` when hardware permits
- [ ] Local Ollama arbitrary installed model path, gated by declared `tools` capability
- [ ] Local vLLM managed DGX Spark/Station profile: `Qwen/Qwen3.6-27B-FP8`
- [ ] Local vLLM managed Linux NVIDIA GPU profile: `nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`
- [ ] Local NVIDIA NIM experimental path
- [ ] Other OpenAI-compatible endpoint path
- [ ] Other Anthropic-compatible endpoint path

## Audit results

Completed rows:

- [x] OpenClaw / Anthropic / `claude-sonnet-4-6` — `pass`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with local sandbox `anth-sonnet-openclaw-audit-0507`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider key `anthropic`, primary model `anthropic/claude-sonnet-4-6`, and API `anthropic-messages` via `https://inference.local`. Workflow: `ANTHROPIC_API_KEY=<redacted> NEMOCLAW_PROVIDER=anthropic NEMOCLAW_MODEL=claude-sonnet-4-6 ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name anth-sonnet-openclaw-audit-0507 --agent openclaw --fresh --recreate-sandbox`, then `openshell sandbox exec -n anth-sonnet-openclaw-audit-0507 --timeout 900 -- /usr/local/bin/nemoclaw-start openclaw agent --agent main --json --thinking off --session-id ... -m <standard and multi-turn prompts>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/anth-sonnet-openclaw-oneshot-1778118974.trajectory.jsonl`, `.jsonl`, `/sandbox/.openclaw/agents/main/sessions/anth-sonnet-openclaw-multiturn-1778119436.trajectory.jsonl`, and `.jsonl`. One-shot recorded `finalStatus: success`, `timedOut: false`, no prompt error, three structured `exec` calls (`hostname`, `date`, `uptime`), correlated Anthropic `tool_use` IDs to `toolResult` entries, and a final assistant summary. Multi-turn reused the same OpenClaw session: turn 1 returned `HOSTNAME=anth-sonnet-openclaw-audit-0507`; turn 2 ran `echo "seen:anth-sonnet-openclaw-audit-0507"` without re-running `hostname` and summarized. Latency: `43.899s` model duration one-shot, `38.971s` turn 1, `47.677s` turn 2. Required affordance: none; registry decision: no #3121 v1 manifest.
- [x] Hermes / Anthropic / `claude-sonnet-4-6` — `blocked`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with sandbox `anth-sonnet-hermes-audit-0507`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, and Hermes API server `127.0.0.1:18642`. Workflow: `ANTHROPIC_API_KEY=<redacted> NEMOCLAW_PROVIDER=anthropic NEMOCLAW_MODEL=claude-sonnet-4-6 ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name anth-sonnet-hermes-audit-0507 --agent hermes --fresh --recreate-sandbox`, then Hermes' own API `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` and the standard shell-loop prompt. Evidence: `/sandbox/.hermes/config.yaml`, `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/sessions/request_dump_api-9816e26b83c423bc_20260507_023806_692889.json`, `/sandbox/.hermes/logs/agent.log`, and `/sandbox/.hermes/logs/errors.log`. NemoClaw generated `model.provider: custom`, `model.base_url: "https://inference.local"`, and no `api_mode`; Hermes therefore called `https://inference.local/chat/completions`. The API returned HTTP 200 in about `2s` with assistant text `Error code: 403 - {'error': 'connection not allowed by policy'}`. Tool-call count: `0`; no final model summary; multi-turn not attempted because one-shot fails before tool use. Required affordance: Hermes provider-config/transport behavior for Anthropic Messages, not model-specific setup. Registry decision: #3121 v1 cannot express this runtime API-mode/provider transport fix cleanly; no manifest.
- [x] OpenClaw / Anthropic / `claude-haiku-4-5` — `pass`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with sandbox `anth-haiku-openclaw-audit-0507`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider key `anthropic`, primary model `anthropic/claude-haiku-4-5`, and API `anthropic-messages` via `https://inference.local`. Workflow: `ANTHROPIC_API_KEY=<redacted> NEMOCLAW_PROVIDER=anthropic NEMOCLAW_MODEL=claude-haiku-4-5 ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name anth-haiku-openclaw-audit-0507 --agent openclaw --fresh --recreate-sandbox`, then `nemoclaw-start openclaw agent --agent main --json --thinking off --session-id ... -m <standard and multi-turn prompts>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/anth-haiku-openclaw-oneshot-1778120014.trajectory.jsonl`, `.jsonl`, `/sandbox/.openclaw/agents/main/sessions/anth-haiku-openclaw-multiturn-1778120085.trajectory.jsonl`, and `.jsonl`. One-shot recorded three structured `exec` calls and a final assistant summary. Multi-turn turn 1 returned `HOSTNAME=anth-haiku-openclaw-audit-0507`; turn 2 ran `echo "seen:anth-haiku-openclaw-audit-0507"`, did not re-run `hostname`, and summarized. Latency: `39.982s` model duration one-shot, `43.531s` turn 1, `39.409s` turn 2. Tool/result correlation used native Anthropic `tool_use` IDs mapped to OpenClaw `toolResult` entries; no prompt error or timeout observed. Required affordance: none; registry decision: no #3121 v1 manifest.
- [x] Hermes / Anthropic / `claude-haiku-4-5` — `blocked`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with sandbox `anth-haiku-hermes-audit-0507`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, and Hermes API server `127.0.0.1:18642`. Workflow: same Hermes onboarding/API path as Sonnet with `NEMOCLAW_MODEL=claude-haiku-4-5`. Evidence: `/sandbox/.hermes/config.yaml`, `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/sessions/request_dump_api-9816e26b83c423bc_20260507_024058_994265.json`, `/sandbox/.hermes/logs/agent.log`, and `/sandbox/.hermes/logs/errors.log`. Generated config was `model.provider: custom`, `model.base_url: "https://inference.local"`, no `api_mode`; request dump showed upstream URL `https://inference.local/chat/completions`. Hermes API returned HTTP 200 in about `1s` with assistant text `Error code: 403 - {'error': 'connection not allowed by policy'}`. Tool-call count: `0`; no final model summary; multi-turn not attempted. Required affordance: Hermes Anthropic Messages provider-config/transport behavior, not model-specific setup. Registry decision: no #3121 v1 manifest.
- [x] OpenClaw / Anthropic / `claude-opus-4-6` — `pass`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with sandbox `anth-opus-openclaw-audit-0507`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider key `anthropic`, primary model `anthropic/claude-opus-4-6`, and API `anthropic-messages` via `https://inference.local`. Workflow: same OpenClaw onboarding/agent path as Sonnet with `NEMOCLAW_MODEL=claude-opus-4-6`; a transient OpenShell `tls handshake eof` during the first create was cleared by restarting the intended `nemoclaw` gateway and resuming onboarding. Evidence: `/sandbox/.openclaw/agents/main/sessions/5795ee4f-ec16-4c6c-9c12-dcf0c0988096.trajectory.jsonl`, `.jsonl`, `/sandbox/.openclaw/agents/main/sessions/d00fc1b7-1f89-416e-bbbf-daafd363db77.trajectory.jsonl`, and `.jsonl`; session keys were `anth-opus-openclaw-oneshot-1778121149` and `anth-opus-openclaw-multiturn-1778121149`. One-shot recorded three structured `exec` calls and a final assistant summary; multi-turn turn 1 returned `HOSTNAME=anth-opus-openclaw-audit-0507`, and turn 2 ran `echo "seen:anth-opus-openclaw-audit-0507"` without re-running `hostname`. Tool-result correlation was correct (`toolu_01QnTcTFxoYqgJUc6ZNunTMf` -> `toolResult`, then `toolu_01RcMUVDog12AhCd6BVkvJN3` -> `toolResult`). Latency: `36.108s` model duration one-shot, `9.253s` turn 1, `6.322s` turn 2. Required affordance: none; registry decision: no #3121 v1 manifest.
- [x] Hermes / Anthropic / `claude-opus-4-6` — `blocked`. Validated on 2026-05-07 UTC on `main` `d98dd8c97d1ddddfd7b6d82962934493dd6e139f` with sandbox `anth-opus-hermes-audit-0507`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, and Hermes API server `127.0.0.1:18642`. Workflow: same Hermes onboarding/API path as Sonnet with `NEMOCLAW_MODEL=claude-opus-4-6`. Evidence: `/sandbox/.hermes/config.yaml`, `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/sessions/request_dump_api-9816e26b83c423bc_20260507_024350_714854.json`, `/sandbox/.hermes/logs/agent.log`, and `/sandbox/.hermes/logs/errors.log`. Generated config was `model.provider: custom`, `model.base_url: "https://inference.local"`, no `api_mode`; request dump showed upstream URL `https://inference.local/chat/completions`. Hermes API returned HTTP 200 in about `2s` with assistant text `Error code: 403 - {'error': 'connection not allowed by policy'}`. Tool-call count: `0`; no final model summary; multi-turn not attempted. Required affordance: Hermes Anthropic Messages provider-config/transport behavior, not model-specific setup. Registry decision: no #3121 v1 manifest.

Additional Anthropic setup audit evidence:

- Anthropic direct API preflight succeeded for all three curated IDs with the supplied temporary key; the agent-surface failures above are therefore NemoClaw/Hermes routing behavior, not unavailable models. Anthropic's tool-use docs require clients to parse `tool_use` blocks and return `tool_result` blocks whose `tool_use_id` matches the original tool-use `id`, with tool-result blocks immediately following the assistant tool-use message. See https://platform.claude.com/docs/en/agents-and-tools/tool-use/handle-tool-calls and https://platform.claude.com/docs/en/agents-and-tools/tool-use/define-tools.
- Extended-thinking docs matter for future Claude 4 agent work: tool use with thinking requires preserving returned thinking blocks, and Sonnet 4.6 / Opus 4.6 have interleaved-thinking behavior under Anthropic's current docs. This audit ran OpenClaw with `--thinking off`, so no new thinking-state preservation affordance was required for these pass rows. See https://platform.claude.com/docs/en/build-with-claude/extended-thinking.
- Static NemoClaw inspection matched the runtime results: `src/lib/onboard-providers.ts` maps Anthropic sandbox config to provider key `anthropic`, primary model `anthropic/<model>`, base URL `https://inference.local`, and `inferenceApi: anthropic-messages`; OpenClaw consumes that route correctly. `agents/hermes/generate-config.ts` / `agents/hermes/config/hermes-config.ts` currently emit Hermes `provider: custom` plus `base_url: https://inference.local` with no Anthropic `api_mode`, so Hermes cannot express the native Anthropic Messages route today. This is provider-adapter/config-path follow-up work, not a per-model registry entry.
- [x] OpenClaw / NVIDIA Endpoints / `minimaxai/minimax-m2.7` — `pass`. Validated on 2026-05-07 UTC on `main` `fa99a37065664f2a4c2af16a0bfc3bb4fac2d605` with local sandbox `minimax-openclaw-audit-0507`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider/model `inference/minimaxai/minimax-m2.7`, and API `openai-completions` via `https://inference.local/v1`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=minimaxai/minimax-m2.7 NEMOCLAW_PREFERRED_API=openai-completions ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name minimax-openclaw-audit-0507 --agent openclaw --fresh --recreate-sandbox`, then `openshell sandbox exec -n minimax-openclaw-audit-0507 --timeout 900 -- /usr/local/bin/nemoclaw-start openclaw agent --agent main --json --thinking off --session-id ... -m <standard and multi-turn prompts>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/minimax-openclaw-oneshot-1778116787.trajectory.jsonl`, `/sandbox/.openclaw/agents/main/sessions/minimax-openclaw-oneshot-1778116787.jsonl`, `/sandbox/.openclaw/agents/main/sessions/minimax-openclaw-multiturn-1778116939.trajectory.jsonl`, and `/sandbox/.openclaw/agents/main/sessions/minimax-openclaw-multiturn-1778116939.jsonl`. One-shot recorded `finalStatus: success`, `timedOut: false`, no prompt error, three structured `exec` tool calls (`hostname`, `date`, `uptime`), and a final assistant summary. Multi-turn reused the same OpenClaw session: turn 1 ran one `exec` call for `hostname` and returned `HOSTNAME=minimax-openclaw-audit-0507`; turn 2 ran one new `exec` call `echo "seen:minimax-openclaw-audit-0507" > /tmp/seen_hostname.txt`, did not re-run `hostname`, wrote the expected value, and summarized successfully. Latency was inside timeout: `122.074s` one-shot, `81.676s` turn 1, and `49.817s` turn 2. Tool-call shape was structured OpenAI-compatible tool calls, not raw tool text; MiniMax thinking was present as OpenClaw `thinking` blocks with `thinkingSignature: reasoning_content`, and final assistant text was non-empty after tool results. Operational note: the CLI printed a gateway websocket `1006` close and used OpenClaw's embedded runner, but the model/provider run completed successfully and persisted the normal trajectory. Required affordance: none beyond the generic OpenClaw `--thinking off`/`thinkingDefault: off` path already used for these sandbox smoke runs; no MiniMax-specific request mutation, response parser, shell rewriter, or plugin is justified. Registry decision: do not add a #3121 v1 manifest because there is no concrete MiniMax-specific setup behavior to express; if a future MiniMax issue required request-body mutation such as explicit `reasoning_split`, registry v1 would not express that class cleanly. External docs checked: [NVIDIA MiniMax M2.7 model card](https://docs.api.nvidia.com/nim/reference/minimaxai-minimax-m2.7), [NVIDIA MiniMax M2.7 infer reference](https://docs.api.nvidia.com/nim/reference/minimaxai-minimax-m2.7-infer), [MiniMax Tool Use & Interleaved Thinking guide](https://platform.minimax.io/docs/guides/text-m2-function-call), [MiniMax OpenAI-compatible chat docs](https://platform.minimax.io/docs/api-reference/text-chat-openai), and [MiniMax-M2 GitHub README](https://github.com/MiniMax-AI/MiniMax-M2).
- [x] Hermes / NVIDIA Endpoints / `minimaxai/minimax-m2.7` — `pass`. Validated on 2026-05-07 UTC on `main` `fa99a37065664f2a4c2af16a0bfc3bb4fac2d605` with local sandbox `minimax-hermes-audit-0507`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, Hermes config `provider: custom`, `base_url: https://inference.local/v1`, and model `minimaxai/minimax-m2.7`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=minimaxai/minimax-m2.7 NEMOCLAW_PREFERRED_API=openai-completions ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name minimax-hermes-audit-0507 --agent hermes --fresh --recreate-sandbox`, then Hermes' own OpenAI-compatible API inside the sandbox: `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` for the standard shell-loop prompt and `POST http://127.0.0.1:18642/v1/responses` with `previous_response_id` for server-side multi-turn continuation. Evidence: `/sandbox/.hermes/config.yaml`, `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/sessions/session_efda2305-58cd-487a-aecc-aba1b0a646b3.json`, `/sandbox/.hermes/logs/agent.log`, and `/sandbox/.hermes/logs/errors.log`. Chat-completions one-shot returned `HTTP 200` in `29.814s`, recorded three structured `terminal` function calls (`hostname`, `date`, `uptime`) with successful tool results, and returned a final assistant summary. Responses multi-turn returned `HTTP 200` in `13.774s` for turn 1 and `17.686s` for turn 2; turn 1 stored one structured `terminal` `hostname` call and `HOSTNAME=minimax-hermes-audit-0507`, while turn 2 chained from `resp_a88cc8e84c644fb2a918a3b037d0`, made one new `terminal` call `echo "seen:minimax-hermes-audit-0507"`, did not make a new `hostname` call in the persisted session, and summarized successfully. Tool-call shape was structured OpenAI-compatible function calling, not raw tool text; MiniMax reasoning was stored in Hermes `reasoning_content` fields and did not break tool-result continuation. `agent.log` contained non-blocking context-length autodetect warnings that defaulted the model to 128,000 tokens; `errors.log` contained only startup warnings about no API-server key/user allowlist. Required affordance: none; no Hermes MiniMax manifest, runtime shim, request mutation, generic parser, or shell rewrite is justified. Registry decision: #3121 v1 could express declarative Hermes compat if a concrete behavior existed, but this audit found none to record.
- [x] OpenClaw / NVIDIA Endpoints / `z-ai/glm-5.1` — `pass`. Validated on 2026-05-07 UTC on `main` `09b66c68384e16e828917b8d7afdbc61893cd4a4` with local sandbox `glm-openclaw-audit-0507`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider/model `inference/z-ai/glm-5.1`, and API `openai-completions` via `https://inference.local/v1`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=z-ai/glm-5.1 NEMOCLAW_PREFERRED_API=openai-completions ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name glm-openclaw-audit-0507 --agent openclaw --fresh --recreate-sandbox`, then `nemoclaw-start openclaw agent --agent main --json --thinking off --session-id ... -m <standard and multi-turn prompts>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/glm-openclaw-oneshot-1778114960.trajectory.jsonl`, `/sandbox/.openclaw/agents/main/sessions/glm-openclaw-oneshot-1778114960.jsonl`, `/sandbox/.openclaw/agents/main/sessions/glm-openclaw-multiturn-1778115140.trajectory.jsonl`, and `/sandbox/.openclaw/agents/main/sessions/glm-openclaw-multiturn-1778115140.jsonl`. One-shot recorded `finalStatus: success`, `timedOut: false`, no prompt error, three structured `exec` tool calls (`hostname`, `date`, `uptime`), and a final assistant summary; no raw tool-call text was persisted as assistant prose. Multi-turn reused the same OpenClaw session: turn 1 ran one `exec` call for `hostname` and returned `HOSTNAME=glm-openclaw-audit-0507`; turn 2 ran one `exec` call `echo 'seen:glm-openclaw-audit-0507'`, did not re-run `hostname`, and summarized successfully. Latency was high but inside timeout: about 69s one-shot, about 147s turn 1, and about 104s turn 2. Required affordance: none beyond generic OpenClaw `--thinking off`/`thinkingDefault: off` behavior already used for sandbox smoke paths; no GLM-specific request mutation, plugin, shell rewriter, or manifest is justified. Registry decision: do not add a #3121 v1 manifest for GLM because there is no concrete GLM-specific behavior to express. External docs checked: [NVIDIA GLM-5.1 model card](https://docs.api.nvidia.com/nim/reference/z-ai-glm5.1), [NVIDIA GLM-5.1 infer reference](https://docs.api.nvidia.com/nim/reference/z-ai-glm5.1-infer), [Z.ai GLM-5.1 overview](https://docs.z.ai/guides/llm/glm-5.1), and [Z.ai thinking mode/tool-result guidance](https://docs.z.ai/guides/capabilities/thinking-mode).
- [x] Hermes / NVIDIA Endpoints / `z-ai/glm-5.1` — `pass`. Validated on 2026-05-07 UTC on `main` `09b66c68384e16e828917b8d7afdbc61893cd4a4` with local sandbox `glm-hermes-audit-0507`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, Hermes config `provider: custom`, `base_url: https://inference.local/v1`, and model `z-ai/glm-5.1`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=z-ai/glm-5.1 NEMOCLAW_PREFERRED_API=openai-completions ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name glm-hermes-audit-0507 --agent hermes --fresh --recreate-sandbox`, then Hermes own OpenAI-compatible API inside the sandbox, `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` for the standard shell-loop prompt and `POST http://127.0.0.1:18642/v1/responses` with `previous_response_id` for server-side multi-turn continuation. Evidence: `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/sessions/session_0441870a-3401-4475-a959-0928220483d5.json`, `/sandbox/.hermes/logs/agent.log`, and `/sandbox/.hermes/logs/errors.log`. Chat-completions one-shot returned `HTTP 200` in `88.693816s`, recorded three structured `terminal` function calls (`hostname`, `date`, `uptime`) with successful tool results, and returned a final assistant summary. Responses multi-turn returned `HTTP 200` in `82.152558s` for turn 1 and `65.737139s` for turn 2; turn 1 stored a structured `terminal` `hostname` call and function-call output, and turn 2 chained from `resp_f44117b7f73e4ac788e3429d2be4`, made one new `terminal` call `echo 'seen:glm-hermes-audit-0507'`, did not make a new `hostname` call, and summarized successfully. `errors.log` contained only startup warnings about no API-server key/user allowlist; no model prompt/runtime errors were observed. Required affordance: none; no Hermes GLM manifest, runtime shim, request mutation, generic parser, or shell rewrite is justified. Registry decision: #3121 v1 could express declarative Hermes compat if a concrete behavior existed, but this audit found none to record.
- [x] OpenClaw / NVIDIA Endpoints / `moonshotai/kimi-k2.6` — `pass-with-affordance`. Fixed by #3046; PR #3121 moves the activation into the agent-scoped model-specific setup registry.
- [x] Hermes / NVIDIA Endpoints / `moonshotai/kimi-k2.6` — `pass`. Validated on PR #3121 head `be8c398bdaba7e1b9d86501515f5ec1ece6a4f3f` using a rebuilt local Hermes sandbox (`hermes-kimi-audit-0506`) and Hermes own OpenAI-compatible API on `127.0.0.1:18642`, not a direct `inference.local` curl. The acceptance prompt produced separate terminal tool calls for `hostname`, `date`, and `uptime`, then a final response. No Hermes Kimi manifest or runtime shim is justified from this evidence. PR evidence: https://github.com/NVIDIA/NemoClaw/pull/3121#issuecomment-4390646818
- [x] OpenClaw / NVIDIA Endpoints / `deepseek-ai/deepseek-v4-pro` — `pass-with-affordance`. Validated on PR #3121 head `be8c398bdaba7e1b9d86501515f5ec1ece6a4f3f` (merged into `main` by `97ae39d4a16472eabb81d0c2e82e36eb6a62d6e9`) with local OpenClaw sandbox `deepseek-openclaw-audit-0506`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider/model `inference/deepseek-ai/deepseek-v4-pro`, and API `openai-completions` via `https://inference.local/v1`. Workflow: `node bin/nemoclaw.js onboard ... --agent openclaw`, then `nemoclaw-start openclaw agent --agent main --json --thinking off --session-id deepseek-openclaw-audit-1778090935 -m <standard shell prompt>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/f7d14bbc-0312-4f5a-b1be-ca17e20a0612.trajectory.jsonl` recorded `finalStatus: success`, `timedOut: false`, `toolMetas` for `hostname`, `date`, and `uptime`, and a final assistant summary. Duration: `99,016ms`. Required affordance remains the existing OpenClaw startup preload request mutation that injects `chat_template_kwargs.thinking = false` for exact model `deepseek-ai/deepseek-v4-pro`; #3121 registry v1 cannot express request mutation, so no DeepSeek manifest was added.
- [x] Hermes / NVIDIA Endpoints / `deepseek-ai/deepseek-v4-pro` — `pass`. Validated on `main` `97ae39d4a16472eabb81d0c2e82e36eb6a62d6e9` with local Hermes sandbox `deepseek-hermes-audit-0506`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, config `provider: custom`, `base_url: https://inference.local/v1`, and model `deepseek-ai/deepseek-v4-pro`. Workflow: `node bin/nemohermes.js onboard ... --agent hermes`, then Hermes own API `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` and the standard shell prompt. Evidence: `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json` recorded three separate `terminal` tool calls (`hostname`, `date`, `uptime`) with successful tool results and final assistant summary; `/sandbox/.hermes/logs/agent.log` recorded main provider `custom (deepseek-ai/deepseek-v4-pro)`. API returned `200`, `finish_reason: stop`, usage `48,431` tokens. No Hermes DeepSeek affordance is justified.
- [x] OpenClaw / NVIDIA Endpoints / `nvidia/nemotron-3-super-120b-a12b` - `pass-with-affordance`. Validated on 2026-05-06 after the local OpenShell DiskPressure condition cleared, on `main` `3477ab7da13c51749eedef1662aa4e998ae0feb2` with local sandbox `nemotron-super-openclaw-audit2-0506`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider/model `inference/nvidia/nemotron-3-super-120b-a12b`, and API `openai-completions` via `https://inference.local/v1`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=nvidia/nemotron-3-super-120b-a12b ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name nemotron-super-openclaw-audit2-0506 --agent openclaw --fresh --recreate-sandbox`, then `nemoclaw-start openclaw agent --agent main --json --thinking off --session-id nemotron-super-openclaw-audit2-1778103869 -m <standard shell prompt>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/aa8473de-504f-4fe3-aaf5-554dd13042a4.trajectory.jsonl` recorded `finalStatus: success`; the session recorded three separate `exec` tool calls for `hostname`, `date`, and `uptime`, followed by a final assistant summary. Duration: `44,400ms`. Required affordance remains the existing OpenClaw startup preload request mutation that injects `chat_template_kwargs.force_nonempty_content = true` for Nemotron chat-completions requests; #3121 registry v1 cannot express request mutation, so no Nemotron manifest should be added yet.
- [x] Hermes / NVIDIA Endpoints / `nvidia/nemotron-3-super-120b-a12b` - `pass`. Validated on 2026-05-06 on `main` `3477ab7da13c51749eedef1662aa4e998ae0feb2` with local Hermes sandbox `nemotron-super-hermes-audit2-0506`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, provider `custom`, base URL `https://inference.local/v1`, and model `nvidia/nemotron-3-super-120b-a12b`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=nvidia/nemotron-3-super-120b-a12b ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name nemotron-super-hermes-audit2-0506 --agent hermes --fresh --recreate-sandbox`, then Hermes own OpenAI-compatible API inside the sandbox, `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` and the standard shell prompt. Evidence: `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json` recorded three actual `terminal` tool calls for `hostname`, `date`, and `uptime`; the API returned `HTTP 200` in `34.680551s` with a final assistant summary. No Hermes Nemotron affordance is justified by this row.
- [x] OpenClaw / NVIDIA Endpoints / `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` - `blocked` by observed OpenClaw agent behavior, not by local infrastructure. Validated on 2026-05-06 on `main` `3477ab7da13c51749eedef1662aa4e998ae0feb2` with sandbox `nemotron-omni-openclaw-audit2-0506`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider/model `inference/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`, and API `openai-completions`. Onboard workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=nvidia/nemotron-3-nano-omni-30b-a3b-reasoning ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name nemotron-omni-openclaw-audit2-0506 --agent openclaw --fresh --recreate-sandbox`. First standard-prompt run, session `nemotron-omni-openclaw-audit2-1778104704`, evidence `/sandbox/.openclaw/agents/main/sessions/573b888d-639f-4440-956e-8f0788d176d5.trajectory.jsonl`, made malformed or unsupported tool attempts around `exec` host selection, read `/etc/hostname`, and stopped without completing `date` or `uptime`. Clean retry, session `nemotron-omni-openclaw-retry-1778104885`, evidence `/sandbox/.openclaw/agents/main/sessions/8a4cdc8a-0765-457f-a5ac-2432be5d4820.trajectory.jsonl`, made three separate successful `exec` calls for `hostname`, `date`, and `uptime` with `toolSummary.failures: 0` and duration `31,821ms`, but the final assistant response was `NO_REPLY`/thinking-only instead of the requested summary. The existing `force_nonempty_content` request mutation is still relevant but insufficient to pass the full OpenClaw shell-loop acceptance scenario for this Omni reasoning model.
- [x] Hermes / NVIDIA Endpoints / `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` - `pass`. Validated on 2026-05-06 on `main` `3477ab7da13c51749eedef1662aa4e998ae0feb2` with local Hermes sandbox `nemotron-omni-hermes-audit2-0506`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, provider `custom`, base URL `https://inference.local/v1`, and model `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`. Workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=nvidia/nemotron-3-nano-omni-30b-a3b-reasoning ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name nemotron-omni-hermes-audit2-0506 --agent hermes --fresh --recreate-sandbox`, then Hermes own OpenAI-compatible API inside the sandbox, `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent` and the standard shell prompt. Evidence: `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json` recorded three actual `terminal` tool calls for `hostname`, `date`, and `uptime`; `/sandbox/.hermes/logs/agent.log` recorded main provider `custom (nvidia/nemotron-3-nano-omni-30b-a3b-reasoning)`. The API returned `HTTP 200` in `28.026483s` with a final assistant summary. No Hermes Nemotron affordance is justified by this row.

- [x] OpenClaw / NVIDIA Endpoints / `openai/gpt-oss-120b` - `degraded`. Validated on 2026-05-06 on current `main` `ca1d6b84a5c938611be412239718f1e46963d8d0` after #3121 was already merged, with local sandbox `gpt-oss-openclaw-audit-0506`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider route `nvidia-prod`, model `inference/openai/gpt-oss-120b`, and API `openai-completions` via `https://inference.local/v1` (NVIDIA Endpoints route to `https://integrate.api.nvidia.com/v1`). Onboard workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=openai/gpt-oss-120b ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name gpt-oss-openclaw-audit-0506 --agent openclaw --fresh --recreate-sandbox`. One-shot workflow: `openshell sandbox exec -n gpt-oss-openclaw-audit-0506 --timeout 900 -- /usr/local/bin/nemoclaw-start openclaw agent --agent main --json --thinking off --session-id gpt-oss-openclaw-oneshot-1778106366 -m <standard shell prompt>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/8cc780fa-23ac-41dd-80c3-146393e39e00.trajectory.jsonl` recorded `finalStatus: success`, `timedOut: false`, no `promptError`, `finishReason: stop`, and final assistant text. Tool behavior: structured OpenAI-compatible tool calls, not raw Harmony/tool text; `thinking` content was stored as thinking metadata, not assistant prose. Tool count: 4 `exec` attempts in one-shot (`hostname` with `security: allowlist` denied, then successful `hostname`, `date`, `uptime` with `security: full`), so the target commands completed but with one extra denied retry; duration `31,922ms`. Multi-turn workflow used session `gpt-oss-openclaw-multiturn-1778106429`; evidence `/sandbox/.openclaw/agents/main/sessions/7fb1d1b4-967c-45e9-b3bd-3e05a5989292.trajectory.jsonl` recorded turn 1 `hostname` -> `HOSTNAME=gpt-oss-openclaw-audit-0506` in `4,767ms`, then turn 2 did not re-run `hostname`, made one shell call `echo "seen:gpt-oss-openclaw-audit-0506" > seen.txt`, and finalized successfully in `4,104ms`. Required affordance: none. Registry decision: do not add a #3121 v1 manifest; there is no model-specific setup effect to express, and the only observed limitation is an OpenClaw tool-argument/security retry rather than request mutation, response normalization, or Harmony parsing.
- [x] Hermes / NVIDIA Endpoints / `openai/gpt-oss-120b` - `pass`. Validated on 2026-05-06 on current `main` `ca1d6b84a5c938611be412239718f1e46963d8d0` with local sandbox `gpt-oss-hermes-audit-0506`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, config `provider: custom`, `base_url: https://inference.local/v1`, and model `openai/gpt-oss-120b`. Onboard workflow: `NEMOCLAW_PROVIDER=build NEMOCLAW_MODEL=openai/gpt-oss-120b ./bin/nemohermes.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name gpt-oss-hermes-audit-0506 --agent hermes --fresh --recreate-sandbox`; config evidence: `/sandbox/.hermes/config.yaml`. API workflow: Hermes own OpenAI-compatible API, `POST http://127.0.0.1:18642/v1/chat/completions` with `model: hermes-agent`, not a direct `inference.local` curl. One-shot evidence: `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`, `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`, and `/tmp/gateway.log`; API returned `HTTP 200` in `7.222s`, `finish_reason: stop`, three structured `terminal` tool calls (`hostname`, `date`, `uptime`), successful tool results, and a final assistant summary. Multi-turn evidence: session `api-92e4a54a694502a8` in `/sandbox/.hermes/sessions/session_api-92e4a54a694502a8.json` plus `/sandbox/.hermes/state.db`; turn 1 returned `HOSTNAME=gpt-oss-hermes-audit-0506` in `2.196s`, and turn 2 returned `HTTP 200` in `4.011s` with one structured `terminal` call `echo "seen:gpt-oss-hermes-audit-0506"` and no second `hostname` call. State DB recorded `message_count: 8`, `tool_call_count: 2`, model `openai/gpt-oss-120b`, source `api_server`; final response summarized the `seen:` output. Raw Harmony markers (`<|...|>`) were absent from persisted message/tool fields; reasoning content was stored separately in `reasoning`/`reasoning_content`. Latency/operability note: the first Hermes sandbox start was interrupted by local OpenShell gateway TLS/ephemeral-storage recovery and a local-image reimport, but the agent/API run passed after the sandbox reached `Ready`; this was local infrastructure, not model behavior. Required affordance: none. Registry decision: do not add a #3121 v1 manifest; no Hermes-specific setup, parser, request mutation, or response normalization is justified by this evidence.

Additional GPT-OSS setup audit evidence:

- 2026-05-06 source/docs audit on `main` `ca1d6b84a5c938611be412239718f1e46963d8d0`: `openai/gpt-oss-120b` is a curated NVIDIA Endpoints model in `src/lib/inference-config.ts`. The NVIDIA Endpoints provider path resolves to chat completions (`openai-completions`) through `https://inference.local/v1` inside the sandbox, while the gateway route targets NVIDIA Endpoints. `src/lib/onboard-inference-probes.ts` uses the generic chat-completions probe for this model; `scripts/nemoclaw-start.sh` only preloads the current Nemotron/DeepSeek request mutations; `agents/hermes/start.sh` and `agents/hermes/generate-config.ts` add no GPT-OSS-specific handling; and `nemoclaw-blueprint/model-specific-setup/**` contains no GPT-OSS manifest.
- External source notes: OpenAI's GPT-OSS/Harmony documentation describes Harmony as the chat/reasoning/tool-call format for raw GPT-OSS serving, and the OpenAI vLLM cookbook documents GPT-OSS serving with `--tool-call-parser openai` and `--reasoning-parser openai_gptoss` for OpenAI-compatible Chat Completions. NVIDIA NIM reasoning-model documentation similarly treats reasoning/parser/template behavior as serving-stack configuration. Relevant docs inspected: https://platform.openai.com/docs/models/gpt-oss, https://cookbook.openai.com/articles/openai-harmony/, https://cookbook.openai.com/articles/gpt-oss/run-vllm/, and https://docs.nvidia.com/nim/large-language-models/1.15.0/reasoning-model.html.
- NemoClaw runtime evidence did not show raw Harmony text from NVIDIA Endpoints for `openai/gpt-oss-120b`. Both OpenClaw and Hermes received structured OpenAI-compatible tool calls with separate reasoning fields and final assistant answers after tool results. Therefore, a generic Harmony parser, shell command rewriter, or #3121 v1 registry manifest should not be added for this model/provider based on this audit. If a future provider returns raw Harmony/tool text, that would be response normalization or serving-template/parser policy, not a v1 manifest effect.

- [x] OpenClaw / Gemini / `gemini-3.1-pro-preview` - `blocked`. Re-run on 2026-05-06/2026-05-07 UTC on current `main` `f586cc59131ec396cfcaab3b915ad76f001210ca` after #3121 was merged, with OpenShell `0.0.36` and OpenClaw `2026.4.24` (`cbcfdf6`). Provider path: NemoClaw provider `gemini`, OpenShell provider `gemini-api`, Google OpenAI-compatible base URL `https://generativelanguage.googleapis.com/v1beta/openai/`, sandbox route `https://inference.local/v1`, model ref `inference/gemini-3.1-pro-preview`, API `openai-completions`, `supportsStore: false`, and OpenClaw config `thinkingDefault: off`. Onboard workflow: `NEMOCLAW_PROVIDER=gemini NEMOCLAW_MODEL=gemini-3.1-pro-preview ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name gemini31-pro-openclaw-audit-0506b --agent openclaw --fresh --recreate-sandbox`. Standard one-shot workflow: `nemoclaw-start openclaw agent --agent main --json --session-id gemini31-openclaw-oneshot-1778111108 -m <standard shell prompt>`. Evidence captured before a later gateway restart removed that sandbox: `/sandbox/.openclaw/agents/main/sessions/16c5d6c7-5096-4ab4-80c9-494d73d74c42.jsonl` and `.trajectory.jsonl`. Result: the model emitted structured OpenAI-compatible `exec` tool calls for `hostname`, `date`, and `uptime` and all three tool results completed, but the continuation/final-answer request after tool results returned `400 status code (no body)`, leaving no final assistant summary. The persisted OpenClaw session contained no `thought_signature` or `extra_content` fields. Multi-turn turn 1 in session `gemini31-openclaw-multiturn-1778111208` similarly made one `hostname` tool call and then failed with `400 status code (no body)`, so turn 2 could not run. Follow-up retry on 2026-05-07 in sandbox `gemini31-pro-openclaw-audit-0506c` first saw stale sandbox DNS/proxy causing `503 "inference service unavailable"`; after `./bin/nemoclaw.js internal dns setup-proxy nemoclaw gemini31-pro-openclaw-audit-0506c` rewired DNS to `10.200.0.1:53 -> 10.42.0.17`, the route was reachable again. Retry session `gemini31-openclaw-retry-dnsfix-1778114081`, log `/tmp/gemini31-openclaw-retry-dnsfix-1778114081.log`, evidence `/sandbox/.openclaw/agents/main/sessions/gemini31-openclaw-retry-dnsfix-1778114081.jsonl` and `.trajectory.jsonl`, emitted three structured `exec` tool calls (`hostname`, `date`, `uptime`) and completed all tool results, then reproduced `400 status code (no body)` with an empty final assistant message; duration `32,162ms`; no `thought_signature` or `extra_content` persisted. Final classification remains blocked by the observed OpenClaw Gemini 3.1 tool-result continuation/state-preservation failure, not by provider availability. Required affordance/fix class: OpenClaw/provider adapter response-history preservation for `tool_calls[].extra_content.google.thought_signature` or equivalent Gemini 3 function-call state handling. Registry decision: do not add a #3121 v1 manifest; this is response/history preservation or adapter behavior, not declarative setup.
- [x] Hermes / Gemini / `gemini-3.1-pro-preview` - `pass`. Validated on 2026-05-07 UTC on `main` `f586cc59131ec396cfcaab3b915ad76f001210ca` with sandbox `gemini31-pro-hermes-audit-0506b`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, config `provider: custom`, `base_url: https://inference.local/v1`, model `gemini-3.1-pro-preview`, and API server forwarded at `http://127.0.0.1:8642/v1`. Config evidence: `/sandbox/.hermes/config.yaml`; logs: `/sandbox/.hermes/logs/agent.log` and `/sandbox/.hermes/logs/errors.log`. API workflow used Hermes' own OpenAI-compatible API, not direct `inference.local`: `POST http://127.0.0.1:8642/v1/chat/completions` with `model: hermes-agent`. One-shot prompt returned `HTTP 200` in `22.920s`, `finish_reason: stop`, and session header `X-Hermes-Session-Id: api-9816e26b83c423bc`; evidence `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json` recorded three structured `terminal` tool calls (`hostname`, `date`, `uptime`) plus final summary. That session persisted `tool_calls[].extra_content.google.thought_signature` on the first Gemini 3.1 tool call, proving Hermes preserved the Google-specific state that OpenClaw dropped. Multi-turn used the same API with explicit message history because this sandbox had no API server key configured and therefore rejects `X-Hermes-Session-Id` continuation; turn 1 returned `HOSTNAME=gemini31-pro-hermes-audit-0506b` in `7.933s`, and turn 2 returned `HTTP 200` in `12.486s` with the same derived session header `api-92e4a54a694502a8`. Evidence `/sandbox/.hermes/sessions/session_api-92e4a54a694502a8.json` recorded one `terminal` call `echo 'seen:gemini31-pro-hermes-audit-0506b' > seen.txt && cat seen.txt`, no second `hostname` call, a successful tool result, and a final summary. Required affordance: none for Hermes. Registry decision: no #3121 v1 manifest.
- [x] OpenClaw / Gemini / `gemini-2.5-pro` - `pass`. Validated on 2026-05-07 UTC on `main` `f586cc59131ec396cfcaab3b915ad76f001210ca` with live sandbox `gemini25-pro-openclaw-audit-0506c`, OpenShell `0.0.36`, OpenClaw `2026.4.24` (`cbcfdf6`), provider route `gemini-api`, model `inference/gemini-2.5-pro`, API `openai-completions`, and `thinkingDefault: off`. Onboard workflow: `NEMOCLAW_PROVIDER=gemini NEMOCLAW_MODEL=gemini-2.5-pro ./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --name gemini25-pro-openclaw-audit-0506c --agent openclaw --fresh --recreate-sandbox`. One-shot workflow: `nemoclaw-start openclaw agent --agent main --local --json --session-id gemini25-openclaw-oneshot-c-1778113536 -m <standard shell prompt>`. Evidence: `/sandbox/.openclaw/agents/main/sessions/gemini25-openclaw-oneshot-c-1778113536.jsonl` and `.trajectory.jsonl` recorded `finalStatus: success`, `timedOut: false`, no prompt error, three separate structured `exec` tool calls (`hostname`, `date`, `uptime`), `toolMetas` for those commands, and a final summary; duration `26.930s`. Multi-turn workflow used session `gemini25-openclaw-multiturn-c-1778113583`; evidence `/sandbox/.openclaw/agents/main/sessions/gemini25-openclaw-multiturn-c-1778113583.jsonl` and `.trajectory.jsonl` recorded turn 1 `hostname` -> `HOSTNAME=gemini25-pro-openclaw-audit-0506c` in `25.913s`, then turn 2 made one `exec` call `echo 'seen:gemini25-pro-openclaw-audit-0506c' > hostname.txt` without re-running `hostname`, and returned a final summary in `28.515s`. No raw function-call text, `thought_signature`, or `extra_content` fields were persisted for Gemini 2.5 in OpenClaw. Required affordance: none. Registry decision: no #3121 v1 manifest.
- [x] Hermes / Gemini / `gemini-2.5-pro` - `pass`. Validated on 2026-05-07 UTC on `main` `f586cc59131ec396cfcaab3b915ad76f001210ca` with sandbox `gemini25-pro-hermes-audit-0506b`, OpenShell `0.0.36`, Hermes Agent `v0.11.0 (2026.4.23)`, config `provider: custom`, `base_url: https://inference.local/v1`, model `gemini-2.5-pro`, and API server forwarded at `http://127.0.0.1:8642/v1`. Config evidence: `/sandbox/.hermes/config.yaml`; logs: `/sandbox/.hermes/logs/agent.log` and `/sandbox/.hermes/logs/errors.log`. API workflow used Hermes' own OpenAI-compatible API, `POST http://127.0.0.1:8642/v1/chat/completions` with `model: hermes-agent`. One-shot returned `HTTP 200` in `12.366s`, `finish_reason: stop`, and session header `api-9816e26b83c423bc`; evidence `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json` recorded three structured `terminal` tool calls (`hostname`, `date`, `uptime`) and a final summary. Multi-turn used explicit message history; retry evidence `/sandbox/.hermes/sessions/session_api-92e4a54a694502a8.json` recorded turn 1 `HOSTNAME=gemini25-pro-hermes-audit-0506b`, then turn 2 made one `terminal` call `echo 'seen:gemini25-pro-hermes-audit-0506b' > hostname.log` without re-running `hostname`, and returned a final summary. Turn latencies on the passing retry were `5.985s` and `8.550s`. No `thought_signature` or `extra_content` fields were observed for Gemini 2.5 Hermes sessions. Operational note: an earlier 2.5 Hermes multi-turn attempt returned `HTTP 200` but wrote a typoed `seen:gemini2s...` command while the final text claimed the correct hostname; the immediate retry passed, and this was a model/tool-argument accuracy hiccup rather than a continuation or thought-signature failure. Required affordance: none. Registry decision: no #3121 v1 manifest.

Additional Gemini setup audit evidence:

- 2026-05-06/2026-05-07 source/docs audit on `main` `f586cc59131ec396cfcaab3b915ad76f001210ca`: Gemini curated onboarding models live in `src/lib/model-prompts.ts`, including `gemini-3.1-pro-preview` and `gemini-2.5-pro`. `src/lib/onboard-providers.ts` wires Google Gemini as OpenAI-compatible provider `gemini-api` with `GEMINI_API_KEY` and `https://generativelanguage.googleapis.com/v1beta/openai/`. `src/lib/onboard-providers.ts` maps Gemini sandbox config to provider key `inference`, primary model `inference/<model>`, `https://inference.local/v1`, `openai-completions`, and `inferenceCompat.supportsStore = false`. `src/lib/validation.ts` skips the Responses API for `gemini-api`, and `src/lib/onboard-inference-probes.ts` sends Bearer auth to the OpenAI-compatible endpoint. `scripts/nemoclaw-start.sh`, `scripts/generate-openclaw-config.py`, `agents/hermes/generate-config.ts`, and `agents/hermes/start.sh` currently add no Gemini-specific thought-signature handling. No Gemini manifests exist under `nemoclaw-blueprint/model-specific-setup/**`.
- Credential/runtime state during this audit: the first pass had no Gemini key, the second pass reached Google validation but was quota-blocked, and this re-run used a funded personal Gemini key that passed onboarding validation. A later OpenClaw 3.1 recreate initially saw `503 "inference service unavailable"` from `https://inference.local/v1/chat/completions`; this was cleared by refreshing the sandbox DNS/proxy with `./bin/nemoclaw.js internal dns setup-proxy`, after which the route was reachable and the same post-tool `400 status code (no body)` reproduced. Gemini 2.5 remained available during the final OpenClaw and Hermes runs.
- External source notes: Google documents Gemini OpenAI compatibility with base URL `https://generativelanguage.googleapis.com/v1beta/openai/`, Bearer `GEMINI_API_KEY`, and `/chat/completions` function calling. Google thought-signature docs state that thinking models in the Gemini 3 and 2.5 series may return thought signatures, that signatures should be passed back exactly in conversation history, and that Gemini 3 models require thought signatures during function calling or a 4xx validation error can result. For OpenAI-compatible chat completions, Google represents signatures under `tool_calls[].extra_content.google.thought_signature`; Gemini 3 requires the first function call signature to be returned, while Gemini 2.5 signature return is documented as optional for function calls. Docs inspected: https://ai.google.dev/gemini-api/docs/openai, https://ai.google.dev/gemini-api/docs/function-calling, https://ai.google.dev/gemini-api/docs/thought-signatures, and https://ai.google.dev/gemini-api/docs/thinking.
- Agent-surface conclusion: Gemini 3.1 Pro is the concrete thought-signature risk. Hermes preserved `extra_content.google.thought_signature` and passed; OpenClaw did not persist `thought_signature`/`extra_content` and failed the post-tool continuation with Google-route `400 status code (no body)`; a later stale-DNS/proxy `503` recreate was cleared and the same post-tool `400` reproduced. Gemini 2.5 Pro passed both OpenClaw and Hermes without a model-specific affordance. If OpenClaw Gemini 3.1 is fixed later, the fix should be scoped to Google/Gemini OpenAI-compatible tool-call state preservation or agent adapter behavior. #3121 registry v1 cannot express that class cleanly, so no manifest should be added based on this audit.

Additional Nemotron setup audit evidence:

- 2026-05-06 architecture audit on `main` `3477ab7da13c51749eedef1662aa4e998ae0feb2`: current OpenClaw behavior remains a runtime request mutation in `scripts/nemoclaw-start.sh`, not a #3121 v1 manifest. The preload wraps Node HTTP(S) `POST /v1/chat/completions` calls and injects `chat_template_kwargs.force_nonempty_content = true` for model IDs matching `/nemotron/i`. Registry v1 can match exact route metadata and apply config/plugin effects, but it cannot express request-body mutations, so no Nemotron registry manifest should be added yet. Future registry support should model request mutations as an explicit OpenClaw-owned effect and should prefer exact supported IDs or provider-class policy once the provider boundary is proven.
- Current onboarding-supported Nemotron IDs affected by the OpenClaw preload are the NVIDIA Endpoints dropdown models `nvidia/nemotron-3-super-120b-a12b` and `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning`, the managed local vLLM Linux profile `nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8`, and local/custom routes whose selected model ID contains `nemotron`, including the Ollama default `nemotron-3-nano:30b` when OpenClaw sends chat-completions for that model. The broad regex is not ideal long-term, but narrowing it now would drop documented compatible-endpoint/NIM/vLLM Nemotron routes and the historical `nvidia/llama-3.3-nemotron-super-49b-v1` failure family without a replacement request-mutation registry capability.
- Direct NVIDIA endpoint probes during this audit show the affordance is still relevant for at least one supported model shape: `nvidia/nemotron-3-nano-omni-30b-a3b-reasoning` returned HTTP 200 with `content: null` and reasoning-only output without `force_nonempty_content`, then returned non-null `content` with `chat_template_kwargs.force_nonempty_content = true`. `nvidia/nemotron-3-super-120b-a12b` returned HTTP 200 with non-empty content in this simple raw probe, but historical issues #1193 and #2051 plus the OpenClaw tool-bearing request shape still justify retaining the mutation until a full agent run proves it obsolete.
- Hermes does not load this Node preload and currently has no Nemotron-specific manifest or runtime shim. Based on code architecture, Hermes passes through its custom chat-completions provider path without this affordance; no Hermes-specific Nemotron handling is justified until Hermes runtime evidence shows a failure.
- Local Ollama was not runtime-tested here: this host only had `qwen2.5:0.5b` and `nemotron-mini:latest` installed, not `nemotron-3-nano:30b`, and no NVIDIA GPU was available for local vLLM. Code inspection shows OpenClaw would currently send the extra `chat_template_kwargs` field for Ollama model IDs containing `nemotron`; future exact manifests/request-mutation metadata should exclude Ollama unless a local Ollama run proves it both accepts and needs the field.

Additional DeepSeek follow-up evidence:

- 2026-05-06 multi-turn continuation attempt — `blocked` by NVIDIA Endpoints rate limiting, not by observed agent/session request-shape behavior. OpenClaw sandbox `deepseek-openclaw-audit-0506` on `main` `97ae39d4a16472eabb81d0c2e82e36eb6a62d6e9` completed turn 1 in persistent session `deepseek-openclaw-multiturn-1778091696` with one `exec` tool call (`hostname`) and `HOSTNAME=deepseek-openclaw-audit-0506`; turn 2 and retry both failed before any model output/tool call with provider `429 status code (no body)`. Evidence: `/sandbox/.openclaw/agents/main/sessions/21f437f3-5c42-4b49-b2d1-08d3def4b6b2.trajectory.jsonl`, `/tmp/gateway.log`, `/tmp/nemoclaw-start.log`. Hermes sandbox `deepseek-hermes-audit-0506` retried the same multi-turn shape through Hermes own API with conversation `deepseek-hermes-multiturn-1778092277`; turn 1 failed before terminal tool use with `HTTP 429: Too Many Requests` after 3 retries. Evidence: `/sandbox/.hermes/sessions/request_dump_api-57283730231debee_20260506_183130_753804.json`, `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`. Re-run needed after endpoint quota resets to prove multi-turn continuation.

- 2026-05-06 19:19-19:26 UTC retry — OpenClaw multi-turn continuation `pass`; Hermes multi-turn still `blocked` by NVIDIA Endpoints 429. Readiness first cleared at `2026-05-06T19:19:49Z` with raw `POST https://inference.local/v1/chat/completions` returning `HTTP 200`, `nvcf-status: fulfilled`, and content `OK`. OpenClaw sandbox `deepseek-openclaw-audit-0506` then completed persistent session `deepseek-openclaw-multiturn-pass-1778095200`: turn 1 made one `exec` tool call for `hostname` and returned `HOSTNAME=deepseek-openclaw-audit-0506`; turn 2 reused that hostname without re-running `hostname`, made one `exec` tool call `printf 'seen:deepseek-openclaw-audit-0506\n'`, and finalized successfully. Evidence: `/sandbox/.openclaw/agents/main/sessions/550bbc05-d91d-4c54-a127-25a61f9c24e3.jsonl` and `.trajectory.jsonl` (`finalStatus: success`, `toolMetas` contains the `printf` command). Hermes sandbox `deepseek-hermes-audit-0506` turn 1 through Hermes own API returned `HOSTNAME=deepseek-hermes-audit-0506`; turn 2 and a retry failed before final continuation with `HTTP 429: Too Many Requests`, and raw one-token readiness was also back to `HTTP 429` at `2026-05-06T19:26:05Z`. Evidence: `/sandbox/.hermes/sessions/session_api-755da3cf323317c1.json`, `/sandbox/.hermes/sessions/request_dump_api-f14936ec1fec0f20_20260506_192147_850678.json`, `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`.

Additional OpenAI runtime audit evidence (refreshed 2026-05-07 UTC):

- Repo state: current `main` at `3351fbdd4eb7d9b80ec471545083956327da2b10`; PR #3121 is merged, so current `main` was used. Checkout was clean on `main...origin/main`. Identity/signing audit before any commit path: global `user.name Aaron Erickson`, `user.email aerickson@nvidia.com`, SSH signing key configured, `commit.gpgsign true`. No repo code changes were made.
- Runtime versions observed: NemoClaw CLI `v0.0.35-40-g3351fbdd`; OpenShell `0.0.36`; OpenClaw `2026.4.24 (cbcfdf6)`; Hermes Agent `v0.11.0 (2026.4.23)` with OpenAI SDK `2.24.0`.
- Official OpenAI docs checked: https://developers.openai.com/api/docs/models/gpt-5.4, https://developers.openai.com/api/docs/models/gpt-5.4-mini, https://developers.openai.com/api/docs/models/gpt-5.4-nano, https://developers.openai.com/api/docs/models/gpt-5.4-pro, https://developers.openai.com/api/docs/guides/migrate-to-responses, https://developers.openai.com/api/docs/guides/function-calling, https://developers.openai.com/api/docs/guides/reasoning. Material points: GPT-5.4/mini/nano document both `v1/responses` and `v1/chat/completions` with function calling; OpenAI documents `previous_response_id` with `store: true`, and stateless reasoning requires encrypted reasoning items; direct live probe confirmed `gpt-5.4-pro-2026-03-05` works on `/v1/responses` but returns 404 on `/v1/chat/completions`.
- Common workflows: OpenClaw used `./bin/nemoclaw.js onboard --non-interactive --yes --yes-i-accept-third-party-software --no-gpu --agent openclaw --fresh --recreate-sandbox` with `NEMOCLAW_PROVIDER=openai`, then `openshell sandbox exec ... /usr/local/bin/nemoclaw-start openclaw agent --agent main --json --thinking off --session-id <id> -m <prompt>`. Hermes used `./bin/nemohermes.js onboard ... --agent hermes`, then Hermes' own local API at `http://127.0.0.1:18642`: `/v1/chat/completions` with `model: hermes-agent` for the one-shot prompt, and `/v1/responses` + `previous_response_id` for same-conversation continuation.
- OpenClaw selected NemoClaw/OpenShell route `openai-api` -> sandbox provider `openai/<model>` via `https://inference.local/v1`, with generated `openclaw.json` `api: openai-responses` for all four models. Hermes generated `model.provider: custom`, `model.base_url: https://inference.local/v1`, and `model.default: <model>`; Hermes did not receive an OpenAI-specific provider/API-mode config from NemoClaw.
- Infra notes: OpenShell gateway had intermittent `tls handshake eof` and later k3s disk-pressure/image-pull recovery during sandbox creation; those were resolved by gateway restart and pruning stale generated sandbox images. They are not counted as model failures. OpenClaw mini had one denied allowlist attempt before retrying `hostname` with full exec in the multi-turn first turn; it still completed correctly and is not a model-specific blocker.

| Agent surface | Provider/model | API path selected | State | Evidence paths | Tool-call and final-response behavior | Multi-turn behavior | Required affordance / registry decision |
| --- | --- | --- | --- | --- | --- | --- | --- |
| OpenClaw | OpenAI / `gpt-5.4` | NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4`; `openai-responses` | `pass` | Sandbox `oa54-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/92728641-7ca3-4898-bae5-6620d5e2c1eb.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/0a61dbbb-217f-4e23-8376-328e964fa07c.trajectory.jsonl` | Structured `exec` calls, not raw tool text. One-shot issued 3 shell calls (`hostname`, `date`, `uptime`) and produced final summary. Observed one-shot duration about 33.1s from CLI output. | Turn 1 ran `hostname` and replied `HOSTNAME=oa54-openclaw-audit-0507`; turn 2 did not re-run `hostname`, ran a shell command for `seen:oa54-openclaw-audit-0507`, then summarized. | No OpenAI-model affordance needed. No registry manifest. |
| Hermes | OpenAI / `gpt-5.4` | Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4` | `pass` | Sandbox `oa54-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_44c94f31-940a-4887-919d-40f01d0328ad.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log` | Structured Hermes `terminal` calls. One-shot issued 3 terminal calls (`hostname`, `date`, `uptime`) and returned a final assistant summary. Latency about 12.1s. | `/v1/responses` continuation issued 2 terminal calls total: `hostname`, then `printf ... seen:oa54-hermes-audit-0507`; no hostname re-run in turn 2; final summary present. | No model-specific affordance. Hermes chat session header continuation requires `API_SERVER_KEY`, but Responses continuation works. No registry manifest. |
| OpenClaw | OpenAI / `gpt-5.4-mini` | NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-mini`; `openai-responses` | `pass` | Sandbox `oa54mini-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/7e0d64e2-89df-41c4-8100-c884f95c159f.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/46776da6-ee31-4513-9091-bc6dd9d6ebe0.trajectory.jsonl` | Structured `exec` calls. One-shot issued 3 shell calls and summarized hostname/date/uptime. Runtime about 6.3s. | Turn 1 first tried restricted `hostname` and got allowlist denial, retried full `hostname`, then turn 2 ran `printf 'seen:%s\n' 'oa54mini-openclaw-audit-0507'` without re-running hostname; final summary present. | No model-specific affordance; allowlist retry is OpenClaw policy behavior. No registry manifest. |
| Hermes | OpenAI / `gpt-5.4-mini` | Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4-mini` | `pass` | Sandbox `oa54mini-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_122350b7-f803-4776-9a68-947ee6d78231.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log` | One-shot made 3 terminal calls plus one `skill_view` prelude, then final summary. Latency about 9.1s. | Multi-turn made `hostname`, then wrote `seen:oa54mini-hermes-audit-0507` to `/tmp/seen_hostname.txt` without re-running hostname; final summary present. | No model-specific affordance. No registry manifest. |
| OpenClaw | OpenAI / `gpt-5.4-nano` | NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-nano`; `openai-responses` | `pass` | Sandbox `oa54nano-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/303e744e-7934-4a13-99cd-cc59dab18478.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/3b89aa65-e644-4c15-bc49-ec1dc0a2adf3.trajectory.jsonl` | Structured `exec` calls. One-shot issued 3 shell calls and summarized hostname/date/uptime. Runtime about 6.9s. | Turn 1 ran `hostname`; turn 2 ran `echo seen:oa54nano-openclaw-audit-0507` without re-running hostname; final summary present. | No model-specific affordance. No registry manifest. |
| Hermes | OpenAI / `gpt-5.4-nano` | Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4-nano` | `pass` | Sandbox `oa54nano-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_1769f4fb-5ef0-4bed-ba6b-47d57c6d3366.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log` | Structured `terminal` calls. One-shot issued 3 terminal calls and summarized hostname/date/uptime. Latency about 8.3s. | Multi-turn made 2 terminal calls total: `hostname`, then `printf 'seen:%s\n' 'oa54nano-hermes-audit-0507'`; no hostname re-run in turn 2; final summary present. | No model-specific affordance. No registry manifest. |
| OpenClaw | OpenAI / `gpt-5.4-pro-2026-03-05` | NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-pro-2026-03-05`; `openai-responses` | `blocked` | Sandbox `oa54pro-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/73b6a9aa-80b4-4991-91fa-7ce77c56e880.trajectory.jsonl`; multi turn-1 `/sandbox/.openclaw/agents/main/sessions/fd0ac825-0b24-4280-8245-b3c2e6fe45ad.trajectory.jsonl` | The model emitted a structured `exec hostname` tool call, not raw text, but after the tool result OpenAI returned `404 Item with id ... not found. Items are not persisted when store is set to false`. One-shot stopped after 1 tool call; no `date`, no `uptime`, no final assistant summary. CLI returned nonzero; latency about 44.2s. | Turn 1 of continuation failed the same way after one structured `hostname` call, so turn 2 was not meaningful. | Needs OpenAI Responses statefulness/stateless-reasoning affordance in the OpenClaw OpenAI transport/agent adapter: e.g. `store:true` when carrying response item ids, or encrypted reasoning items when using `store:false`. This is provider transport/request-response behavior, not a JSON setup manifest. #3121 registry v1 cannot express it. |
| Hermes | OpenAI / `gpt-5.4-pro-2026-03-05` | Hermes local API accepted requests, but upstream `custom` path called `https://inference.local/v1/chat/completions` with model `gpt-5.4-pro-2026-03-05` | `blocked` | Sandbox `oa54pro-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_fbb5c5f1-3ee5-4929-85b1-07ca0bd0f953.json`; request dumps under `/sandbox/.hermes/sessions/request_dump_*042156*.json`, `*042204*.json`, `*042215*.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log` | No tool calls. Hermes returned an assistant error string after upstream retry exhaustion: `HTTP 404: This is not a chat model and thus not supported in the v1/chat/completions endpoint`. One-shot local API status was 200 but semantic result is blocked. Latency about 9.8s. | `/v1/responses` continuation also failed before any tool call because the Hermes agent's upstream model call still used `/v1/chat/completions`; both turns returned the same 404 error text. | Needs Hermes provider/API-mode selection for OpenAI Responses when the model is Responses-only. This mirrors the Anthropic Hermes config-path class (`custom` against `https://inference.local`) but the failure mode differs: OpenAI pro returns model/endpoint 404 rather than provider-policy 403. This is provider/agent config work, not a per-model setup manifest. #3121 registry v1 cannot express it. |

Overall OpenAI verdict: `gpt-5.4`, `gpt-5.4-mini`, and `gpt-5.4-nano` are valid curated agent models for both OpenClaw and Hermes with no model-specific registry affordance. `gpt-5.4-pro-2026-03-05` is blocked on both surfaces for different OpenAI Responses integration reasons: OpenClaw reaches Responses and tool calls but mishandles reasoning/state continuation with `store:false`; Hermes routes through a custom chat-completions upstream path that the pro snapshot rejects. No code change was made in this audit; if fixed later, the work should be scoped to provider transport/API mode or agent adapter behavior, not registry v1 manifests.

## Initial risk classification

### Already has model-aware behavior in code

- [x] `moonshotai/kimi-k2.6`
  - OpenClaw: `pass-with-affordance`; fixed by #3046 and tracked for registry refactor in #3121.
  - Hermes: `pass`; validated on PR #3121 head `be8c398bdaba7e1b9d86501515f5ec1ece6a4f3f` with no Hermes-specific affordance needed.
- [x] `deepseek-ai/deepseek-v4-pro`
  - OpenClaw: `pass-with-affordance`; exact-model runtime request mutation in `scripts/nemoclaw-start.sh` injects `chat_template_kwargs.thinking = false` for `/v1/chat/completions`.
  - Hermes: `pass`; custom chat-completions path works with no DeepSeek-specific Hermes manifest or runtime shim.
  - Registry decision: do not add a manifest under #3121 v1. The live behavior is request mutation plus onboarding validation policy, while registry v1 only expresses agent config/plugin effects. Move the OpenClaw mutation later only after a declarative, OpenClaw-owned request-mutation effect exists; keep onboarding timeout/streaming policy in `src/lib/onboard-inference-probes.ts` or adjacent validation-policy metadata.
- [x] Nemotron-family models
  - Current OpenClaw runtime preload injects `chat_template_kwargs.force_nonempty_content = true` for model IDs matching `nemotron`.
  - Registry decision for #3121 v1: do not add a manifest yet. The behavior is a request-body mutation, while v1 manifests only express route matching plus config/plugin effects. Move later only after request-mutation support exists.
  - Runtime agent validation after the gateway recovered: OpenClaw Super is `pass-with-affordance`, Hermes Super is `pass`, Hermes Omni is `pass`, and OpenClaw Omni is `blocked` by model/runtime response behavior (`NO_REPLY`/thinking-only final after tool results), not by infrastructure.

### High-priority discovery targets

- [ ] `Qwen/Qwen3.6-27B-FP8` through managed local vLLM
  - Managed profile already uses Qwen-specific vLLM parser flags.
- [x] `openai/gpt-oss-120b` through NVIDIA Endpoints
  - Reasoning/tool-capable OSS model behind an OpenAI-compatible route.
- [x] Gemini Pro models through Google's OpenAI-compatible endpoint
  - Re-run with funded Gemini key on 2026-05-06/2026-05-07 UTC. Gemini 3.1 Pro: OpenClaw `blocked` by tool-result continuation/state handling (`400 status code (no body)` after structured tool calls; stale-DNS/proxy `503` cleared on retry), Hermes `pass` with `extra_content.google.thought_signature` preserved. Gemini 2.5 Pro: OpenClaw `pass`, Hermes `pass`.
- [x] `z-ai/glm-5.1`
  - Audited on 2026-05-07 UTC for OpenClaw and Hermes through NVIDIA Endpoints; no GLM-specific setup behavior justified.
- [x] `minimaxai/minimax-m2.7`
  - Discovery required before adding any setup behavior.

### Provider-class policy, not necessarily model-specific setup

- [ ] Other OpenAI-compatible endpoints defaulting to `/v1/chat/completions`
- [ ] Local vLLM forcing `/v1/chat/completions`
- [ ] Local NIM forcing `/v1/chat/completions`
- [ ] Local Ollama forcing `/v1/chat/completions`
- [ ] Local Ollama `tools` capability gate
- [ ] Anthropic-compatible provider adapter behavior

## Required audit scenarios

Each model/provider/agent combination should be classified with evidence from a repeatable scenario set.

### Baseline chat

- [ ] Simple deterministic response works.
- [ ] Provider validation succeeds or fails with a clear actionable reason.
- [ ] No provider credentials leak into sandbox-visible files, logs, or prompts.

### Shell tool loop

Use a standard prompt such as:

```text
Run hostname, then run date, then run uptime. Use separate shell tool calls for each command, and after the tool results, summarize what you saw.
```

Required checks:

- [ ] Structured tool calls are emitted.
- [ ] Tool calls are persisted in the trajectory with non-empty metadata.
- [ ] The expected tool-call count is visible.
- [ ] Tool-call IDs/names correlate with tool results.
- [ ] No raw function-call text is persisted as assistant prose.
- [ ] No combined `hostname; date; uptime` command appears unless that is the explicit expected behavior for the test.
- [ ] No `promptError`.
- [ ] No empty assistant stop.
- [ ] No reasoning-only stop.
- [ ] A final assistant response appears after all tool results.

### Multi-turn continuation

- [ ] The model can use a tool result from turn 1 to decide a dependent tool call in turn 2.
- [ ] Reasoning/thinking state, if present, does not break the next tool turn.
- [ ] The model does not ask the user to continue after a complete tool result.

### Sub-agent delegation

- [ ] Primary agent can decide to delegate.
- [ ] `sessions_spawn` request is structured correctly.
- [ ] Sub-agent receives the intended task, model config, and workspace path.
- [ ] Sub-agent can use tools if the role requires tools.
- [ ] Primary agent can consume the sub-agent result and continue.

### Hermes path

- [ ] Hermes sandbox starts with the selected provider/model.
- [ ] Hermes OpenAI-compatible API returns the expected response shape.
- [ ] Tool/capability expectations are explicit for Hermes, even if Hermes does not exercise the same OpenClaw tool stack.
- [ ] Failures are separated from OpenClaw-only request-shape issues.

### Performance and operability

Track at least:

- [ ] Validation time
- [ ] Time to first token or first streamed event when available
- [ ] Total scenario duration
- [ ] Retry behavior
- [ ] Timeout budget used
- [ ] Whether the model needs streaming to be reliable
- [ ] Whether the model needs a model-specific request mutation
- [ ] Whether the model needs provider-specific API path forcing
- [ ] Whether cold-start behavior differs from warm behavior

## Result states

Every row in the audit matrix should end in one of these states:

- `pass`: works without model-specific changes
- `pass-with-affordance`: works with a documented model/provider affordance
- `degraded`: usable but has documented limitations or performance concerns
- `blocked`: cannot complete required scenarios; follow-up issue required
- `unsupported`: not a supported target for this agent surface
- `not-yet-run`: still pending

## Evidence requirements

Every completed row should include:

- model ID
- provider path
- agent surface
- NemoClaw commit SHA
- OpenClaw/OpenShell version if available
- endpoint/API path selected
- exact command or workflow used
- pass/fail state
- trajectory/log path or CI job link
- observed tool-call count
- observed final-response behavior
- latency/timeout notes
- required affordance, if any
- linked follow-up issue or PR when remediation is needed

## Acceptance criteria for this tracker

- [ ] A maintained audit matrix exists in repo docs or test artifacts, not only in issue comments.
- [ ] Every current onboarding model has an explicit state for OpenClaw primary-agent use.
- [ ] Every current onboarding provider class has an explicit state for the supported local/custom paths.
- [ ] Hermes support is classified separately from OpenClaw support.
- [ ] Sub-agent behavior is tested for at least one primary model and one auxiliary/specialist model path.
- [ ] Each discovered model-specific intervention is either:
  - linked to #3120 as a registry/setup task, or
  - documented as provider-class transport policy outside the registry.
- [ ] No model-specific behavior is added without a focused repro and acceptance test.
- [ ] Kimi K2.6 remains covered as the regression seed case from #3046.

## Non-goals

- Do not turn this into a generic benchmark suite divorced from agent behavior.
- Do not rank models by subjective answer quality unless it affects agent task completion.
- Do not add broad shell-command rewriting as a substitute for provider/model compatibility.
- Do not make model-specific behavior global across providers or endpoints.
- Do not block model onboarding solely because an optional advanced agent surface has not yet been audited; instead classify the gap explicitly.

## Related work

- #2620
- #3046
- #3120















Agent surface	Provider/model	API path selected	State	Evidence paths	Tool-call and final-response behavior	Multi-turn behavior	Required affordance / registry decision
OpenClaw	OpenAI / `gpt-5.4`	NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4`; `openai-responses`	`pass`	Sandbox `oa54-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/92728641-7ca3-4898-bae5-6620d5e2c1eb.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/0a61dbbb-217f-4e23-8376-328e964fa07c.trajectory.jsonl`	Structured `exec` calls, not raw tool text. One-shot issued 3 shell calls (`hostname`, `date`, `uptime`) and produced final summary. Observed one-shot duration about 33.1s from CLI output.	Turn 1 ran `hostname` and replied `HOSTNAME=oa54-openclaw-audit-0507`; turn 2 did not re-run `hostname`, ran a shell command for `seen:oa54-openclaw-audit-0507`, then summarized.	No OpenAI-model affordance needed. No registry manifest.
Hermes	OpenAI / `gpt-5.4`	Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4`	`pass`	Sandbox `oa54-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_44c94f31-940a-4887-919d-40f01d0328ad.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`	Structured Hermes `terminal` calls. One-shot issued 3 terminal calls (`hostname`, `date`, `uptime`) and returned a final assistant summary. Latency about 12.1s.	`/v1/responses` continuation issued 2 terminal calls total: `hostname`, then `printf ... seen:oa54-hermes-audit-0507`; no hostname re-run in turn 2; final summary present.	No model-specific affordance. Hermes chat session header continuation requires `API_SERVER_KEY`, but Responses continuation works. No registry manifest.
OpenClaw	OpenAI / `gpt-5.4-mini`	NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-mini`; `openai-responses`	`pass`	Sandbox `oa54mini-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/7e0d64e2-89df-41c4-8100-c884f95c159f.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/46776da6-ee31-4513-9091-bc6dd9d6ebe0.trajectory.jsonl`	Structured `exec` calls. One-shot issued 3 shell calls and summarized hostname/date/uptime. Runtime about 6.3s.	Turn 1 first tried restricted `hostname` and got allowlist denial, retried full `hostname`, then turn 2 ran `printf 'seen:%s\n' 'oa54mini-openclaw-audit-0507'` without re-running hostname; final summary present.	No model-specific affordance; allowlist retry is OpenClaw policy behavior. No registry manifest.
Hermes	OpenAI / `gpt-5.4-mini`	Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4-mini`	`pass`	Sandbox `oa54mini-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_122350b7-f803-4776-9a68-947ee6d78231.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`	One-shot made 3 terminal calls plus one `skill_view` prelude, then final summary. Latency about 9.1s.	Multi-turn made `hostname`, then wrote `seen:oa54mini-hermes-audit-0507` to `/tmp/seen_hostname.txt` without re-running hostname; final summary present.	No model-specific affordance. No registry manifest.
OpenClaw	OpenAI / `gpt-5.4-nano`	NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-nano`; `openai-responses`	`pass`	Sandbox `oa54nano-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/303e744e-7934-4a13-99cd-cc59dab18478.trajectory.jsonl`; multi `/sandbox/.openclaw/agents/main/sessions/3b89aa65-e644-4c15-bc49-ec1dc0a2adf3.trajectory.jsonl`	Structured `exec` calls. One-shot issued 3 shell calls and summarized hostname/date/uptime. Runtime about 6.9s.	Turn 1 ran `hostname`; turn 2 ran `echo seen:oa54nano-openclaw-audit-0507` without re-running hostname; final summary present.	No model-specific affordance. No registry manifest.
Hermes	OpenAI / `gpt-5.4-nano`	Hermes local `/v1/chat/completions` and `/v1/responses`; upstream config `custom` -> `https://inference.local/v1`, model `gpt-5.4-nano`	`pass`	Sandbox `oa54nano-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_1769f4fb-5ef0-4bed-ba6b-47d57c6d3366.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`	Structured `terminal` calls. One-shot issued 3 terminal calls and summarized hostname/date/uptime. Latency about 8.3s.	Multi-turn made 2 terminal calls total: `hostname`, then `printf 'seen:%s\n' 'oa54nano-hermes-audit-0507'`; no hostname re-run in turn 2; final summary present.	No model-specific affordance. No registry manifest.
OpenClaw	OpenAI / `gpt-5.4-pro-2026-03-05`	NemoClaw/OpenShell `openai-api`; sandbox `openai/gpt-5.4-pro-2026-03-05`; `openai-responses`	`blocked`	Sandbox `oa54pro-openclaw-audit-0507`; one-shot `/sandbox/.openclaw/agents/main/sessions/73b6a9aa-80b4-4991-91fa-7ce77c56e880.trajectory.jsonl`; multi turn-1 `/sandbox/.openclaw/agents/main/sessions/fd0ac825-0b24-4280-8245-b3c2e6fe45ad.trajectory.jsonl`	The model emitted a structured `exec hostname` tool call, not raw text, but after the tool result OpenAI returned `404 Item with id ... not found. Items are not persisted when store is set to false`. One-shot stopped after 1 tool call; no `date`, no `uptime`, no final assistant summary. CLI returned nonzero; latency about 44.2s.	Turn 1 of continuation failed the same way after one structured `hostname` call, so turn 2 was not meaningful.	Needs OpenAI Responses statefulness/stateless-reasoning affordance in the OpenClaw OpenAI transport/agent adapter: e.g. `store:true` when carrying response item ids, or encrypted reasoning items when using `store:false`. This is provider transport/request-response behavior, not a JSON setup manifest. #3121 registry v1 cannot express it.
Hermes	OpenAI / `gpt-5.4-pro-2026-03-05`	Hermes local API accepted requests, but upstream `custom` path called `https://inference.local/v1/chat/completions` with model `gpt-5.4-pro-2026-03-05`	`blocked`	Sandbox `oa54pro-hermes-audit-0507`; one-shot `/sandbox/.hermes/sessions/session_api-9816e26b83c423bc.json`; multi `/sandbox/.hermes/sessions/session_fbb5c5f1-3ee5-4929-85b1-07ca0bd0f953.json`; request dumps under `/sandbox/.hermes/sessions/request_dump_042156.json`, `042204.json`, `042215.json`; logs `/sandbox/.hermes/logs/agent.log`, `/sandbox/.hermes/logs/errors.log`	No tool calls. Hermes returned an assistant error string after upstream retry exhaustion: `HTTP 404: This is not a chat model and thus not supported in the v1/chat/completions endpoint`. One-shot local API status was 200 but semantic result is blocked. Latency about 9.8s.	`/v1/responses` continuation also failed before any tool call because the Hermes agent's upstream model call still used `/v1/chat/completions`; both turns returned the same 404 error text.	Needs Hermes provider/API-mode selection for OpenAI Responses when the model is Responses-only. This mirrors the Anthropic Hermes config-path class (`custom` against `https://inference.local`) but the failure mode differs: OpenAI pro returns model/endpoint 404 rather than provider-policy 403. This is provider/agent config work, not a per-model setup manifest. #3121 registry v1 cannot express it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model performance / capability audit across supported agents #3123

Purpose

Background

Agent surfaces in scope

Supported model inventory to audit

NVIDIA Endpoints

OpenAI

Anthropic

Gemini

Local and experimental providers

Audit results

Initial risk classification

Already has model-aware behavior in code

High-priority discovery targets

Provider-class policy, not necessarily model-specific setup

Required audit scenarios

Baseline chat

Shell tool loop

Multi-turn continuation

Sub-agent delegation

Hermes path

Performance and operability

Result states

Evidence requirements

Acceptance criteria for this tracker

Non-goals

Related work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model performance / capability audit across supported agents #3123

Description

Purpose

Background

Agent surfaces in scope

Supported model inventory to audit

NVIDIA Endpoints

OpenAI

Anthropic

Gemini

Local and experimental providers

Audit results

Initial risk classification

Already has model-aware behavior in code

High-priority discovery targets

Provider-class policy, not necessarily model-specific setup

Required audit scenarios

Baseline chat

Shell tool loop

Multi-turn continuation

Sub-agent delegation

Hermes path

Performance and operability

Result states

Evidence requirements

Acceptance criteria for this tracker

Non-goals

Related work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions