Add behavioral evals for shell command tool usage and safety

## Summary
The current evaluation suite lacks coverage for behavioral patterns related to shell command usage and safety.

## Problem
There is no explicit validation for:
- Proper use of run_shell_command for shell-related tasks
- Avoidance of destructive commands like rm -rf without safeguards
- Correct tool selection when creating files (prefer write_file / create_file over shell commands)

## Proposed Solution
Add a new evaluation suite (evals/shell_command_safety.eval.ts) with behavioral tests that:
- Verify run_shell_command is used for file listing and disk usage queries
- Ensure destructive commands are not executed silently
- Enforce usage of dedicated file-writing tools for file creation

## Impact
- Improves safety guarantees
- Strengthens tool selection correctness
- Expands eval coverage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add behavioral evals for shell command tool usage and safety #23920

Summary

Problem

Proposed Solution

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add behavioral evals for shell command tool usage and safety #23920

Description

Summary

Problem

Proposed Solution

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions