Skip to content

Empty sandbox + permissive default grader silently produces passing-but-useless runs #273

Description

@JayDoubleu

Summary

waza run silently produces meaningless results when the task prompt references
a path that isn't present in the agent's sandbox. Two compounding behaviours:

  1. --context-dir does not bulk-copy a directory into the agent's sandboxed
    workspace. It only acts as the resolution base for task inputs.files.
  2. With no inputs.files and the default grader (regex_match: ["\\w+"]),
    any reply, including an "I cannot find the path you mentioned, can you
    confirm it?" apology, passes the run.

So when a user writes a prompt like
"explore the repository at ./my-repo and describe its architecture",
the agent lands in an empty /tmp/waza-<id>/, can't find ./my-repo,
apologises, and the run is marked passed.

What I observed

Running a 20-trial sweep (two arms, single architecture-description prompt,
copilot-sdk, claude-sonnet-4.5) with a relative path in the prompt:

  • Most runs: agent reports the path doesn't exist, asks the user to confirm,
    exits with 2-3 turns and ~50-100k tokens (all wasted on the dance).
  • A small number of runs: the agent runs find / (escaping the sandbox),
    stumbles onto the real path on the host, and completes — one such run
    consumed 445k tokens.
  • Every single one of these was marked passed because the agent's reply
    matched \w+.

Switching the prompt to an absolute path resolved at run time produced
clean, comparable runs (20/20 success, ~480k mean tokens, normal stdev).

Why this is bad

  • Silent data contamination. Token-usage / cost / behaviour comparisons
    are meaningless when half the runs are the agent giving up early and
    half are it spelunking outside the sandbox.
  • No log/warning indicates the workspace is empty or that the prompt
    refers to a missing path.
  • The default grader effectively makes "the model produced any text" a
    pass, which is too permissive for tool-using evals. (Related theme to
    After 0.33.0 tool calls are getting rejected with "unexpected user permission response" #266, where 0.33.0 broke all tool calls but runs still passed.)

Suggested fixes (any subset would help)

  • Document --context-dir behaviour explicitly: it is a fixtures-file
    base, not a workspace populator. README/INTEGRATION-TESTING examples
    using ./fixtures reinforce the misconception.
  • Add a context_dir (or workspace) eval/task setting that bulk-copies
    (or bind-mounts) a directory into the sandbox.
  • Log the sandbox contents at run start when -v, or warn if the
    sandbox is empty and the prompt contains relative path references.
  • Make the default grader stricter, or warn when the only configured
    grader is a permissive regex like \w+.
  • Consider sandboxing more aggressively: the agent shouldn't be able to
    reach arbitrary host paths via find /.

Environment

Happy to share a minimal repro spec if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions