Summary
The current evaluation suite lacks coverage for behavioral patterns related to shell command usage and safety.
Problem
There is no explicit validation for:
- Proper use of run_shell_command for shell-related tasks
- Avoidance of destructive commands like rm -rf without safeguards
- Correct tool selection when creating files (prefer write_file / create_file over shell commands)
Proposed Solution
Add a new evaluation suite (evals/shell_command_safety.eval.ts) with behavioral tests that:
- Verify run_shell_command is used for file listing and disk usage queries
- Ensure destructive commands are not executed silently
- Enforce usage of dedicated file-writing tools for file creation
Impact
- Improves safety guarantees
- Strengthens tool selection correctness
- Expands eval coverage
Summary
The current evaluation suite lacks coverage for behavioral patterns related to shell command usage and safety.
Problem
There is no explicit validation for:
Proposed Solution
Add a new evaluation suite (evals/shell_command_safety.eval.ts) with behavioral tests that:
Impact