🔬 AgentProbe

Playwright for AI Agents

Test tool calls, not just text output. YAML-based. Works with any LLM.

Quick Start · Why? · Comparison · Docs · Discord

Why AgentProbe?

LLM test tools validate text output. But agents don't just generate text — they pick tools, handle failures, and process user data autonomously. One bad tool call → PII leak. One missed step → silent workflow failure.

AgentProbe tests what agents do, not just what they say.

tests:
  - input: "Book a flight NYC → London, next Friday"
    expect:
      tool_called: search_flights
      tool_called_with: { origin: "NYC", dest: "LDN" }
      output_contains: "flight"
      no_pii_leak: true
      max_steps: 5

4 assertions. 1 YAML file. Zero boilerplate.

⚡ Quick Start

npm install @neuzhou/agentprobe
npx agentprobe init                                    # Scaffold test project
npx agentprobe run examples/quickstart/test-mock.yaml  # Run first test

No API key needed for the mock adapter.

Programmatic API

import { AgentProbe } from '@neuzhou/agentprobe';

const probe = new AgentProbe({ adapter: 'openai', model: 'gpt-4o' });
const result = await probe.test({
  input: 'What is the capital of France?',
  expect: {
    output_contains: 'Paris',
    no_hallucination: true,
    latency_ms: { max: 3000 },
  },
});

How AgentProbe Compares

	AgentProbe	Promptfoo	DeepEval
Tool call assertions	✅ 6 types	❌	❌
Chaos & fault injection	✅	❌	❌
Contract testing	✅	❌	❌
Multi-agent orchestration	✅	❌	❌
Record & replay	✅	❌	❌
Security scanning	✅ PII, injection, system leak	✅ Red teaming	⚠️ Basic
LLM-as-Judge	✅ Any model	✅	✅
YAML test definitions	✅	✅	❌ Python only
CI/CD (JUnit, GH Actions)	✅	✅	✅

Promptfoo tests prompts. DeepEval tests LLM outputs. AgentProbe tests agent behavior.

Features


🎯 Tool Call Assertions	`tool_called`, `tool_called_with`, `no_tool_called`, `tool_call_order` + 2 more
💥 Chaos Testing	Inject tool timeouts, malformed responses, rate limits
📜 Contract Testing	Enforce behavioral invariants across agent versions
🤝 Multi-Agent Testing	Test handoff sequences in orchestrated pipelines
🔴 Record & Replay	Record live sessions → generate tests → replay deterministically
🛡️ Security Scanning	PII leak, prompt injection, system prompt exposure
🧑‍⚖️ LLM-as-Judge	Use a stronger model to evaluate nuanced quality
📊 HTML Reports	Self-contained dashboards with SVG charts
🔄 Regression Detection	Compare against saved baselines
🤖 12 Adapters	OpenAI, Anthropic, Google, Ollama, and 8 more

📖 Full Docs — 17+ assertion types, 12 adapters, 120+ CLI commands

📺 See it in action

$ agentprobe run tests/booking.yaml

  🔬 Agent Booking Test
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ✅ Agent calls search_flights tool (12ms)
  ✅ Tool called with correct parameters (8ms)
  ✅ No PII leaked in response (3ms)
  ✅ Agent handles booking confirmation (15ms)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  4/4 passed (100%) in 38ms

4 assertions, 1 YAML file, zero boilerplate.

🚀 GitHub Action

# .github/workflows/agent-tests.yml
name: Agent Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: NeuZhou/agentprobe@master
        with:
          test_dir: './tests'

Roadmap

🌐 Also Check Out

Project	What it does
FinClaw	Self-evolving trading engine — 484 factors, genetic algorithm, walk-forward validated
ClawGuard	AI Agent Immune System — 480+ threat patterns, zero dependencies

Contributing

We welcome contributions! Here's how to get started:

Pick an issue — look for good first issue labels

Fork & clone

git clone https://github.com/NeuZhou/agentprobe.git
cd agentprobe && npm install && npm test

Submit a PR — we review within 48 hours

CONTRIBUTING.md · Discord · Report Bug · Request Feature

License

MIT © NeuZhou

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
examples		examples
references		references
skill		skill
src		src
tests		tests
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
.secret-patterns		.secret-patterns
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.ja.md		README.ja.md
README.ko.md		README.ko.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents

Why AgentProbe?

⚡ Quick Start

Programmatic API

How AgentProbe Compares

Features

🚀 GitHub Action

Roadmap

🌐 Also Check Out

Contributing

License

Star History

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 AgentProbe

Playwright for AI Agents

Why AgentProbe?

⚡ Quick Start

Programmatic API

How AgentProbe Compares

Features

🚀 GitHub Action

Roadmap

🌐 Also Check Out

Contributing

License

Star History

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages