Skip to content

Add behavioral evals for shell command tool usage and safety #23920

@Rahu378

Description

@Rahu378

Summary

The current evaluation suite lacks coverage for behavioral patterns related to shell command usage and safety.

Problem

There is no explicit validation for:

  • Proper use of run_shell_command for shell-related tasks
  • Avoidance of destructive commands like rm -rf without safeguards
  • Correct tool selection when creating files (prefer write_file / create_file over shell commands)

Proposed Solution

Add a new evaluation suite (evals/shell_command_safety.eval.ts) with behavioral tests that:

  • Verify run_shell_command is used for file listing and disk usage queries
  • Ensure destructive commands are not executed silently
  • Enforce usage of dedicated file-writing tools for file creation

Impact

  • Improves safety guarantees
  • Strengthens tool selection correctness
  • Expands eval coverage

Metadata

Metadata

Assignees

Labels

area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitystatus/need-triageIssues that need to be triaged by the triage automation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions