Summary
The netclaw -p (headless single-prompt) mode and netclaw chat --resume <id> (interactive TUI resume) are separate code paths that don't compose. This blocks multi-turn eval cases, KV cache benchmarking, compaction regression testing, and any scripted conversation that needs more than one turn.
What Changes
1. Move -p under chat as a flag
Current:
netclaw -p "hello" # top-level shortcut, headless
netclaw chat # interactive TUI
netclaw chat --resume <id> # resume, interactive TUI
Proposed:
netclaw chat -p "hello" # new session, headless
netclaw chat -p --resume <id> "follow-up" # resume session, headless
netclaw chat --resume <id> # resume session, interactive TUI
netclaw chat # new session, interactive TUI
Keep netclaw -p as a backward-compat alias that delegates to netclaw chat -p.
2. --resume with "ensure session" semantics
When --resume <id> specifies a session ID that doesn't exist, create a new session with that ID instead of failing. This gives callers deterministic session naming without needing to capture IDs from prior turns.
Semantics:
- Session exists → resume it (append to existing conversation)
- Session doesn't exist → create it with the given ID
This is idempotent, which makes scripting trivial:
# Eval runner: deterministic session names, no capture/parse between turns
netclaw chat -p --resume "eval/grounding-test-1" "hello"
netclaw chat -p --resume "eval/grounding-test-1" "what did I just say?"
netclaw chat -p --resume "eval/grounding-test-1" "what is your session id?"
3. Output format for headless resume
chat -p --resume should output the same format as current -p:
- Default: plain text (assistant's response)
--json: structured JSON including sessionId, response text, tool calls, usage
The sessionId field in --json output is how callers discover the ID when they DON'T use --resume (auto-generated session). When they DO use --resume, it echoes back the ID they specified.
Motivation
Multi-turn evals
The eval suite (evals/run-evals.sh) is single-turn only — every case runs netclaw -p which creates a fresh session. This means we can't test:
- Compaction behavior (needs 10+ turns to trigger)
- Post-compaction grounding (does the agent remember context after compaction?)
- KV cache performance (does turn N respond faster than turn 1?)
- Conversation continuity (does the agent maintain coherence across turns?)
- Session ID self-awareness (does the agent know its own session ID after compaction?)
All of these were real production failures during the compaction rework (PR #597, #598).
KV cache benchmarking
Session-sticky LLM routing (PR #610 / issue #609) pins same-session requests to the same GPU for KV cache reuse. Measuring the impact requires multi-turn conversations where turn 2+ should be measurably faster than turn 1. Single-turn evals can't observe this.
Scripted test scenarios
QA workflows, regression tests, and demo scripts all benefit from scripted multi-turn conversations without needing the interactive TUI.
Docker Smoke Test
The Smoke Sandbox CI check (scripts/docker/smoke-test.sh or equivalent) should gain a basic multi-turn validation:
# Turn 1: create named session
netclaw chat -p --resume "smoke/multi-turn" "hello"
# Turn 2: resume and verify continuity
RESPONSE=$(netclaw chat -p --resume "smoke/multi-turn" "what was my first message?")
# Assert the agent references "hello" in some form
echo "$RESPONSE" | grep -qi "hello"
This validates that --resume creates, resumes, and maintains conversation state through the daemon's persistence layer.
Acceptance Criteria
Out of scope
Summary
The
netclaw -p(headless single-prompt) mode andnetclaw chat --resume <id>(interactive TUI resume) are separate code paths that don't compose. This blocks multi-turn eval cases, KV cache benchmarking, compaction regression testing, and any scripted conversation that needs more than one turn.What Changes
1. Move
-punderchatas a flagCurrent:
Proposed:
Keep
netclaw -pas a backward-compat alias that delegates tonetclaw chat -p.2.
--resumewith "ensure session" semanticsWhen
--resume <id>specifies a session ID that doesn't exist, create a new session with that ID instead of failing. This gives callers deterministic session naming without needing to capture IDs from prior turns.Semantics:
This is idempotent, which makes scripting trivial:
3. Output format for headless resume
chat -p --resumeshould output the same format as current-p:--json: structured JSON includingsessionId, response text, tool calls, usageThe
sessionIdfield in--jsonoutput is how callers discover the ID when they DON'T use--resume(auto-generated session). When they DO use--resume, it echoes back the ID they specified.Motivation
Multi-turn evals
The eval suite (
evals/run-evals.sh) is single-turn only — every case runsnetclaw -pwhich creates a fresh session. This means we can't test:All of these were real production failures during the compaction rework (PR #597, #598).
KV cache benchmarking
Session-sticky LLM routing (PR #610 / issue #609) pins same-session requests to the same GPU for KV cache reuse. Measuring the impact requires multi-turn conversations where turn 2+ should be measurably faster than turn 1. Single-turn evals can't observe this.
Scripted test scenarios
QA workflows, regression tests, and demo scripts all benefit from scripted multi-turn conversations without needing the interactive TUI.
Docker Smoke Test
The
Smoke SandboxCI check (scripts/docker/smoke-test.shor equivalent) should gain a basic multi-turn validation:This validates that
--resumecreates, resumes, and maintains conversation state through the daemon's persistence layer.Acceptance Criteria
netclaw chat -p "prompt"works identically to currentnetclaw -p "prompt"netclaw chat -p --resume <id> "prompt"sends a headless prompt to an existing or new session with the given IDnetclaw -premains as a backward-compat alias--resumewith a non-existent ID creates the session with that ID (ensure semantics)--jsonoutput includessessionIdfieldchat -p --resume-ptests continue to passOut of scope
chat --resumealready works)