You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
waza run silently produces meaningless results when the task prompt references
a path that isn't present in the agent's sandbox. Two compounding behaviours:
--context-dir does not bulk-copy a directory into the agent's sandboxed
workspace. It only acts as the resolution base for task inputs.files.
With no inputs.files and the default grader (regex_match: ["\\w+"]),
any reply, including an "I cannot find the path you mentioned, can you
confirm it?" apology, passes the run.
So when a user writes a prompt like
"explore the repository at ./my-repo and describe its architecture",
the agent lands in an empty /tmp/waza-<id>/, can't find ./my-repo,
apologises, and the run is marked passed.
What I observed
Running a 20-trial sweep (two arms, single architecture-description prompt,
copilot-sdk, claude-sonnet-4.5) with a relative path in the prompt:
Most runs: agent reports the path doesn't exist, asks the user to confirm,
exits with 2-3 turns and ~50-100k tokens (all wasted on the dance).
A small number of runs: the agent runs find / (escaping the sandbox),
stumbles onto the real path on the host, and completes — one such run
consumed 445k tokens.
Every single one of these was marked passed because the agent's reply
matched \w+.
Switching the prompt to an absolute path resolved at run time produced
clean, comparable runs (20/20 success, ~480k mean tokens, normal stdev).
Why this is bad
Silent data contamination. Token-usage / cost / behaviour comparisons
are meaningless when half the runs are the agent giving up early and
half are it spelunking outside the sandbox.
No log/warning indicates the workspace is empty or that the prompt
refers to a missing path.
Document --context-dir behaviour explicitly: it is a fixtures-file
base, not a workspace populator. README/INTEGRATION-TESTING examples
using ./fixtures reinforce the misconception.
Add a context_dir (or workspace) eval/task setting that bulk-copies
(or bind-mounts) a directory into the sandbox.
Log the sandbox contents at run start when -v, or warn if the
sandbox is empty and the prompt contains relative path references.
Make the default grader stricter, or warn when the only configured
grader is a permissive regex like \w+.
Consider sandboxing more aggressively: the agent shouldn't be able to
reach arbitrary host paths via find /.
Summary
waza runsilently produces meaningless results when the task prompt referencesa path that isn't present in the agent's sandbox. Two compounding behaviours:
--context-dirdoes not bulk-copy a directory into the agent's sandboxedworkspace. It only acts as the resolution base for task
inputs.files.inputs.filesand the default grader (regex_match: ["\\w+"]),any reply, including an "I cannot find the path you mentioned, can you
confirm it?" apology, passes the run.
So when a user writes a prompt like
"explore the repository at
./my-repoand describe its architecture",the agent lands in an empty
/tmp/waza-<id>/, can't find./my-repo,apologises, and the run is marked
passed.What I observed
Running a 20-trial sweep (two arms, single architecture-description prompt,
copilot-sdk, claude-sonnet-4.5) with a relative path in the prompt:
exits with 2-3 turns and ~50-100k tokens (all wasted on the dance).
find /(escaping the sandbox),stumbles onto the real path on the host, and completes — one such run
consumed 445k tokens.
passedbecause the agent's replymatched
\w+.Switching the prompt to an absolute path resolved at run time produced
clean, comparable runs (20/20 success, ~480k mean tokens, normal stdev).
Why this is bad
are meaningless when half the runs are the agent giving up early and
half are it spelunking outside the sandbox.
refers to a missing path.
pass, which is too permissive for tool-using evals. (Related theme to
After 0.33.0 tool calls are getting rejected with "unexpected user permission response" #266, where 0.33.0 broke all tool calls but runs still passed.)
Suggested fixes (any subset would help)
--context-dirbehaviour explicitly: it is a fixtures-filebase, not a workspace populator. README/INTEGRATION-TESTING examples
using
./fixturesreinforce the misconception.context_dir(orworkspace) eval/task setting that bulk-copies(or bind-mounts) a directory into the sandbox.
-v, or warn if thesandbox is empty and the prompt contains relative path references.
grader is a permissive regex like
\w+.reach arbitrary host paths via
find /.Environment
Happy to share a minimal repro spec if useful.