Skip to content

test(evals): add behavioral eval for file creation and write_file tool selection#26292

Merged
akh64bit merged 4 commits intomainfrom
feature/file-creation-behavior-eval
May 1, 2026
Merged

test(evals): add behavioral eval for file creation and write_file tool selection#26292
akh64bit merged 4 commits intomainfrom
feature/file-creation-behavior-eval

Conversation

@akh64bit
Copy link
Copy Markdown
Contributor

Fixes #24806

Update instructions to prompt the agent to stage only task-specific changed files to prevent unprompted staging of untracked files.

Fixes #24628
@akh64bit akh64bit requested review from a team as code owners April 30, 2026 21:07
@github-actions
Copy link
Copy Markdown

You already have 7 pull requests open. Please work on getting existing PRs merged before opening more.

@github-actions github-actions Bot closed this Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

🛑 Action Required: Evaluation Approval

Steering changes have been detected in this PR. To prevent regressions, a maintainer must approve the evaluation run before this PR can be merged.

Maintainers:

  1. Go to the Workflow Run Summary.
  2. Click the yellow 'Review deployments' button.
  3. Select the 'eval-gate' environment and click 'Approve'.

Once approved, the evaluation results will be posted here automatically.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the agent's behavioral evaluations and prompt instructions to improve its interaction with the file system and Git. The changes aim to ensure safer and more precise file operations, particularly regarding file creation and Git staging, by adding new tests and refining agent guidance.

Highlights

  • New Behavioral Evaluations: Added a new behavioral evaluation suite to test file creation and the safe use of the write_file tool, ensuring files are created correctly and existing ones are not inadvertently overwritten. Also, a new test was added to the gitRepo evaluation to prevent agents from using broad git add . or git add -A commands.
  • Prompt Snippet Updates: Modified prompt snippets in packages/core/src/prompts/snippets.ts and packages/core/src/prompts/snippets.legacy.ts to explicitly instruct agents against using git add . or git add -A, promoting more precise file staging with git add <file>....
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Size Change: +430 B (0%)

Total Size: 33.9 MB

Filename Size Change
./bundle/chunk-4DGQCZNX.js 0 B -12.5 kB (removed) 🏆
./bundle/chunk-IOFKZKEL.js 0 B -657 kB (removed) 🏆
./bundle/chunk-LEY75LQP.js 0 B -2.72 MB (removed) 🏆
./bundle/chunk-NF4DZIIK.js 0 B -3.43 kB (removed) 🏆
./bundle/chunk-PZEZPDMI.js 0 B -3.8 kB (removed) 🏆
./bundle/chunk-U4PBE6HP.js 0 B -19.5 kB (removed) 🏆
./bundle/chunk-VA7435CI.js 0 B -49.2 kB (removed) 🏆
./bundle/chunk-ZULJCKWJ.js 0 B -14.7 MB (removed) 🏆
./bundle/core-IUJFYSPW.js 0 B -48.2 kB (removed) 🏆
./bundle/devtoolsService-XSJB4FYG.js 0 B -28 kB (removed) 🏆
./bundle/gemini-IMCJSXQJ.js 0 B -582 kB (removed) 🏆
./bundle/interactiveCli-JWEBBIIH.js 0 B -1.32 MB (removed) 🏆
./bundle/liteRtServerManager-CYCFDJV3.js 0 B -2.11 kB (removed) 🏆
./bundle/oauth2-provider-NNI322CS.js 0 B -9.16 kB (removed) 🏆
./bundle/chunk-6TTVQ74O.js 657 kB +657 kB (new file) 🆕
./bundle/chunk-7G2LL3P2.js 2.72 MB +2.72 MB (new file) 🆕
./bundle/chunk-FAGKMQRM.js 49.2 kB +49.2 kB (new file) 🆕
./bundle/chunk-KRHNQDMF.js 19.5 kB +19.5 kB (new file) 🆕
./bundle/chunk-NH6GW4O6.js 3.8 kB +3.8 kB (new file) 🆕
./bundle/chunk-NI2ARPJU.js 14.7 MB +14.7 MB (new file) 🆕
./bundle/chunk-QTQACY5S.js 12.5 kB +12.5 kB (new file) 🆕
./bundle/chunk-VZP5AA7G.js 3.43 kB +3.43 kB (new file) 🆕
./bundle/core-XIIOQWPY.js 48.2 kB +48.2 kB (new file) 🆕
./bundle/devtoolsService-6MSKBSCA.js 28 kB +28 kB (new file) 🆕
./bundle/gemini-WB4FMV3H.js 582 kB +582 kB (new file) 🆕
./bundle/interactiveCli-NZQP3FUM.js 1.32 MB +1.32 MB (new file) 🆕
./bundle/liteRtServerManager-KMRM4TZO.js 2.11 kB +2.11 kB (new file) 🆕
./bundle/oauth2-provider-4LFR37R6.js 9.16 kB +9.16 kB (new file) 🆕
ℹ️ View Unchanged
Filename Size Change
./bundle/bundled/third_party/index.js 8 MB 0 B
./bundle/chunk-34MYV7JD.js 2.45 kB 0 B
./bundle/chunk-533APETE.js 1.97 MB 0 B
./bundle/chunk-5AUYMPVF.js 858 B 0 B
./bundle/chunk-5PS3AYFU.js 1.18 kB 0 B
./bundle/chunk-664ZODQF.js 124 kB 0 B
./bundle/chunk-DAHVX5MI.js 206 kB 0 B
./bundle/chunk-IUUIT4SU.js 56.5 kB 0 B
./bundle/chunk-RJTRUG2J.js 39.8 kB 0 B
./bundle/cleanup-SIRDETV3.js 0 B -932 B (removed) 🏆
./bundle/devtools-36NN55EP.js 696 kB 0 B
./bundle/dist-T73EYRDX.js 356 B 0 B
./bundle/events-XB7DADIJ.js 418 B 0 B
./bundle/examples/hooks/scripts/on-start.js 188 B 0 B
./bundle/examples/mcp-server/example.js 1.43 kB 0 B
./bundle/gemini.js 5.14 kB 0 B
./bundle/getMachineId-bsd-TXG52NKR.js 1.55 kB 0 B
./bundle/getMachineId-darwin-7OE4DDZ6.js 1.55 kB 0 B
./bundle/getMachineId-linux-SHIFKOOX.js 1.34 kB 0 B
./bundle/getMachineId-unsupported-5U5DOEYY.js 1.06 kB 0 B
./bundle/getMachineId-win-6KLLGOI4.js 1.72 kB 0 B
./bundle/memoryDiscovery-LIJKMASE.js 980 B 0 B
./bundle/multipart-parser-KPBZEGQU.js 11.7 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/client/main.js 222 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/_client-assets.js 229 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/index.js 13.4 kB 0 B
./bundle/node_modules/@google/gemini-cli-devtools/dist/src/types.js 132 B 0 B
./bundle/sandbox-macos-permissive-open.sb 890 B 0 B
./bundle/sandbox-macos-permissive-proxied.sb 1.31 kB 0 B
./bundle/sandbox-macos-restrictive-open.sb 3.36 kB 0 B
./bundle/sandbox-macos-restrictive-proxied.sb 3.56 kB 0 B
./bundle/sandbox-macos-strict-open.sb 4.82 kB 0 B
./bundle/sandbox-macos-strict-proxied.sb 5.02 kB 0 B
./bundle/src-QVCVGIUX.js 47 kB 0 B
./bundle/start-QADGKLJO.js 0 B -652 B (removed) 🏆
./bundle/tree-sitter-7U6MW5PS.js 274 kB 0 B
./bundle/tree-sitter-bash-34ZGLXVX.js 1.84 MB 0 B
./bundle/cleanup-RQWTYYB7.js 932 B +932 B (new file) 🆕
./bundle/start-3IMGVZEF.js 652 B +652 B (new file) 🆕

compressed-size-action

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new behavioral evaluations for file creation and git repository management. It also updates the system prompts to explicitly discourage the use of broad git staging commands like git add . or git add -A, favoring specific file staging instead. Feedback on the new tests suggests parsing JSON tool arguments rather than using string inclusion checks to ensure the assertions are more robust and less prone to failure.

Note: Security Review is unavailable for this PR.

Comment thread evals/file_creation_behavior.eval.ts Outdated
Comment on lines +66 to +70
const targetReadFileIndex = logs.findIndex(
(log) =>
log.toolRequest?.name === 'read_file' &&
log.toolRequest.args.includes('config.json'),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using String.prototype.includes() on a JSON string to check tool arguments is brittle. It's better to parse the JSON and inspect the properties directly. This makes the test more robust against formatting changes and less prone to being flaky, which is critical for maintaining a reliable CI pipeline.

Suggested change
const targetReadFileIndex = logs.findIndex(
(log) =>
log.toolRequest?.name === 'read_file' &&
log.toolRequest.args.includes('config.json'),
);
const targetReadFileIndex = logs.findIndex((log) => {
try {
return log.toolRequest?.name === 'read_file' && JSON.parse(log.toolRequest.args).path === 'config.json';
} catch { return false; }
});
References
  1. The toolRequest.args property is a JSON string and must be parsed using JSON.parse() before its properties can be accessed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated to use JSON parsing instead of includes() string checks.

Comment thread evals/file_creation_behavior.eval.ts Outdated
Comment on lines +72 to +76
const targetWriteFileIndex = logs.findIndex(
(log) =>
log.toolRequest?.name === 'write_file' &&
log.toolRequest.args.includes('config.json'),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the read_file check, using String.prototype.includes() here is brittle. Parsing the JSON arguments for write_file will make this assertion more robust and reliable.

Suggested change
const targetWriteFileIndex = logs.findIndex(
(log) =>
log.toolRequest?.name === 'write_file' &&
log.toolRequest.args.includes('config.json'),
);
const targetWriteFileIndex = logs.findIndex((log) => {
try {
return log.toolRequest?.name === 'write_file' && JSON.parse(log.toolRequest.args).path === 'config.json';
} catch { return false; }
});
References
  1. The toolRequest.args property is a JSON string and must be parsed using JSON.parse() before its properties can be accessed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Same change applied here to parse JSON arguments safely.

@gemini-cli gemini-cli Bot added the area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality label Apr 30, 2026
@akh64bit akh64bit added this pull request to the merge queue May 1, 2026
Merged via the queue into main with commit b3e6c28 May 1, 2026
28 of 29 checks passed
@akh64bit akh64bit deleted the feature/file-creation-behavior-eval branch May 1, 2026 03:58
TirthNaik-99 pushed a commit to TirthNaik-99/gemini-cli that referenced this pull request May 4, 2026
kimjune01 pushed a commit to kimjune01/gemini-cli-claude that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(evals): add behavioral eval for file creation and write_file tool selection

2 participants