Empty sandbox + permissive default grader silently produces passing-but-useless runs

## Summary

`waza run` silently produces meaningless results when the task prompt references
a path that isn't present in the agent's sandbox. Two compounding behaviours:

1. `--context-dir` does not bulk-copy a directory into the agent's sandboxed
   workspace. It only acts as the resolution base for task `inputs.files`.
2. With no `inputs.files` and the default grader (`regex_match: ["\\w+"]`),
   any reply, including an "I cannot find the path you mentioned, can you
   confirm it?" apology, **passes** the run.

So when a user writes a prompt like
"explore the repository at `./my-repo` and describe its architecture",
the agent lands in an empty `/tmp/waza-<id>/`, can't find `./my-repo`,
apologises, and the run is marked `passed`.

## What I observed

Running a 20-trial sweep (two arms, single architecture-description prompt,
copilot-sdk, claude-sonnet-4.5) with a relative path in the prompt:

- Most runs: agent reports the path doesn't exist, asks the user to confirm,
  exits with 2-3 turns and ~50-100k tokens (all wasted on the dance).
- A small number of runs: the agent runs `find /` (escaping the sandbox),
  stumbles onto the real path on the host, and completes — one such run
  consumed 445k tokens.
- Every single one of these was marked `passed` because the agent's reply
  matched `\w+`.

Switching the prompt to an absolute path resolved at run time produced
clean, comparable runs (20/20 success, ~480k mean tokens, normal stdev).

## Why this is bad

- Silent data contamination. Token-usage / cost / behaviour comparisons
  are meaningless when half the runs are the agent giving up early and
  half are it spelunking outside the sandbox.
- No log/warning indicates the workspace is empty or that the prompt
  refers to a missing path.
- The default grader effectively makes "the model produced any text" a
  pass, which is too permissive for tool-using evals. (Related theme to
  #266, where 0.33.0 broke all tool calls but runs still passed.)

## Suggested fixes (any subset would help)

- Document `--context-dir` behaviour explicitly: it is a fixtures-file
  base, not a workspace populator. README/INTEGRATION-TESTING examples
  using `./fixtures` reinforce the misconception.
- Add a `context_dir` (or `workspace`) eval/task setting that bulk-copies
  (or bind-mounts) a directory into the sandbox.
- Log the sandbox contents at run start when `-v`, or warn if the
  sandbox is empty and the prompt contains relative path references.
- Make the default grader stricter, or warn when the only configured
  grader is a permissive regex like `\w+`.
- Consider sandboxing more aggressively: the agent shouldn't be able to
  reach arbitrary host paths via `find /`.

## Environment

- waza v0.31.0 (also reproduces with v0.33.0, modulo #266)
- copilot-sdk executor, claude-sonnet-4.5
- Linux

Happy to share a minimal repro spec if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Empty sandbox + permissive default grader silently produces passing-but-useless runs #273

Summary

What I observed

Why this is bad

Suggested fixes (any subset would help)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Empty sandbox + permissive default grader silently produces passing-but-useless runs #273

Description

Summary

What I observed

Why this is bad

Suggested fixes (any subset would help)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions