test(evals): add behavioral eval for file creation and write_file tool selection

The `evals/` directory currently has no behavioral eval covering the agent's `write_file` tool selection behavior during file creation tasks. Existing evals cover `grep_search`, `read_file` frugality, `edit`/`replace` for modifications, and various agent steering behaviors — but the file creation path is untested at the behavioral level.

The `write_file` integration test in `integration-tests/write_file.test.ts` only verifies the tool functions correctly (writes to disk), but does not test whether the **model chooses the right action** in realistic scenarios:

- Does the agent use `write_file` (not `edit`) when asked to create a new file?
- Does the agent read an existing file before overwriting it?
- Does the agent correctly scaffold multiple related files in the right directory structure?

These are real quality gaps — I've observed the agent occasionally using `replace`/`edit` on non-existent files or overwriting existing files without reading them first.

### Proposed fix

Add `evals/file_creation_behavior.eval.ts` with three `USUALLY_PASSES` behavioral evals:

1. **`should create a new file in the correct directory when asked`** — Verifies the agent uses `write_file` to create `src/logger.ts` in an existing project, places it in the correct directory, and does not modify existing files when instructed not to.

2. **`should not overwrite existing file when creating new file with same name`** — Verifies the agent reads `config.json` before overwriting it when asked to create a new config with different settings. Tests the agent's awareness of existing file content.

3. **`should scaffold multiple related files in correct locations`** — Verifies the agent creates `src/auth/validator.ts` and `src/auth/types.ts` with correct exports, in the right directory structure, without modifying existing project files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): add behavioral eval for file creation and write_file tool selection #24806

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

test(evals): add behavioral eval for file creation and write_file tool selection #24806

Description

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions