Skip to content

test(evals): add behavioral eval for file creation and write_file tool selection #24806

@surajsahani

Description

@surajsahani

The evals/ directory currently has no behavioral eval covering the agent's write_file tool selection behavior during file creation tasks. Existing evals cover grep_search, read_file frugality, edit/replace for modifications, and various agent steering behaviors — but the file creation path is untested at the behavioral level.

The write_file integration test in integration-tests/write_file.test.ts only verifies the tool functions correctly (writes to disk), but does not test whether the model chooses the right action in realistic scenarios:

  • Does the agent use write_file (not edit) when asked to create a new file?
  • Does the agent read an existing file before overwriting it?
  • Does the agent correctly scaffold multiple related files in the right directory structure?

These are real quality gaps — I've observed the agent occasionally using replace/edit on non-existent files or overwriting existing files without reading them first.

Proposed fix

Add evals/file_creation_behavior.eval.ts with three USUALLY_PASSES behavioral evals:

  1. should create a new file in the correct directory when asked — Verifies the agent uses write_file to create src/logger.ts in an existing project, places it in the correct directory, and does not modify existing files when instructed not to.

  2. should not overwrite existing file when creating new file with same name — Verifies the agent reads config.json before overwriting it when asked to create a new config with different settings. Tests the agent's awareness of existing file content.

  3. should scaffold multiple related files in correct locations — Verifies the agent creates src/auth/validator.ts and src/auth/types.ts with correct exports, in the right directory structure, without modifying existing project files.

Metadata

Metadata

Assignees

Labels

area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitystatus/need-triageIssues that need to be triaged by the triage automation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions