Test tool calls, not just text output. YAML-based. Works with any LLM.
Quick Start · Why? · Comparison · Docs · Discord
LLM test tools validate text output. But agents don't just generate text — they pick tools, handle failures, and process user data autonomously. One bad tool call → PII leak. One missed step → silent workflow failure.
AgentProbe tests what agents do, not just what they say.
tests:
- input: "Book a flight NYC → London, next Friday"
expect:
tool_called: search_flights
tool_called_with: { origin: "NYC", dest: "LDN" }
output_contains: "flight"
no_pii_leak: true
max_steps: 54 assertions. 1 YAML file. Zero boilerplate.
npm install @neuzhou/agentprobe
npx agentprobe init # Scaffold test project
npx agentprobe run examples/quickstart/test-mock.yaml # Run first testNo API key needed for the mock adapter.
import { AgentProbe } from '@neuzhou/agentprobe';
const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
input: 'What is the capital of France?',
expect: {
output_contains: 'Paris',
no_hallucination: true,
latency_ms: { max: 3000 },
},
});| AgentProbe | Promptfoo | DeepEval | |
|---|---|---|---|
| Tool call assertions | ✅ 6 types | ❌ | ❌ |
| Chaos & fault injection | ✅ | ❌ | ❌ |
| Contract testing | ✅ | ❌ | ❌ |
| Multi-agent orchestration | ✅ | ❌ | ❌ |
| Record & replay | ✅ | ❌ | ❌ |
| Security scanning | ✅ PII, injection, system leak | ✅ Red teaming | |
| LLM-as-Judge | ✅ Any model | ✅ | ✅ |
| YAML test definitions | ✅ | ✅ | ❌ Python only |
| CI/CD (JUnit, GH Actions) | ✅ | ✅ | ✅ |
Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.
| 🎯 Tool Call Assertions | tool_called, tool_called_with, no_tool_called, tool_call_order + 2 more |
| 💥 Chaos Testing | Inject tool timeouts, malformed responses, rate limits |
| 📜 Contract Testing | Enforce behavioral invariants across agent versions |
| 🤝 Multi-Agent Testing | Test handoff sequences in orchestrated pipelines |
| 🔴 Record & Replay | Record live sessions → generate tests → replay deterministically |
| 🛡️ Security Scanning | PII leak, prompt injection, system prompt exposure |
| 🧑⚖️ LLM-as-Judge | Use a stronger model to evaluate nuanced quality |
| 📊 HTML Reports | Self-contained dashboards with SVG charts |
| 🔄 Regression Detection | Compare against saved baselines |
| 🤖 12 Adapters | OpenAI, Anthropic, Google, Ollama, and 8 more |
📖 Full Docs — 17+ assertion types, 12 adapters, 120+ CLI commands
📺 See it in action
$ agentprobe run tests/booking.yaml
🔬 Agent Booking Test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Agent calls search_flights tool (12ms)
✅ Tool called with correct parameters (8ms)
✅ No PII leaked in response (3ms)
✅ Agent handles booking confirmation (15ms)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
4/4 passed (100%) in 38ms
4 assertions, 1 YAML file, zero boilerplate.
# .github/workflows/agent-tests.yml
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: NeuZhou/agentprobe@master
with:
test_dir: './tests'- YAML behavioral testing · 17+ assertions · 12 adapters
- Tool mocking · Chaos testing · Contract testing
- Multi-agent · Record & replay · Security scanning
- HTML reports · JUnit output · GitHub Actions
- AWS Bedrock / Azure OpenAI adapters
- VS Code extension with test explorer
- Web dashboard for test results
- A/B testing for agent configurations
- Automated regression detection in CI
- Plugin marketplace for custom assertions
- OpenTelemetry trace integration
| Project | What it does |
|---|---|
| FinClaw | Self-evolving trading engine — 484 factors, genetic algorithm, walk-forward validated |
| ClawGuard | AI Agent Immune System — 480+ threat patterns, zero dependencies |
We welcome contributions! Here's how to get started:
- Pick an issue — look for
good first issuelabels - Fork & clone
git clone https://github.com/NeuZhou/agentprobe.git cd agentprobe && npm install && npm test
- Submit a PR — we review within 48 hours
CONTRIBUTING.md · Discord · Report Bug · Request Feature
