Source
Agent-Testing Agent (ATA): Meta-Agent for Adversarial Behavioral Testing
https://arxiv.org/abs/2508.17393 — August 2025
Summary
ATA is a meta-agent that combines static analysis, designer interrogation, and persona-driven adversarial test generation with adaptive difficulty controlled by an LLM-as-judge scoring rubric. It generates behavioral test cases for conversational agents rather than relying on hand-written scenarios.
Applicability to Zeph
HIGH. Zeph's continuous improvement protocol (.claude/rules/continuous-improvement.md) explicitly requires live agent testing but currently relies entirely on manual scenario crafting. The gap between CI unit tests and real behavioral testing is the #1 bottleneck in the CI cycle.
Proposed integration
Build an ATA-style harness on top of AgentTestHarness (already in the codebase from ARCH-08):
- Catalog introspection: load Zeph's skill registry + tool definitions to seed scenario generation
- Scenario generation: use a separate LLM (e.g.,
summary_model) to generate adversarial prompts targeting:
- Memory recall boundary conditions (just-expired memories, conflicting facts)
- Tool invocation edge cases (large output → overflow, permission denial, tool chaining)
- Skill matching precision (ambiguous queries that should/shouldn't match)
- Security injection attempts (prompt injection in tool results, web scrape content)
- Adaptive difficulty: an LLM judge scores agent responses; scenarios that score high are escalated with harder variants
- Output: structured test cases in
regressions.md format with expected behavior labels
Location
- New binary or subcommand:
zeph test-gen (or --test-gen)
- Stores generated scenarios in
.local/testing/playbooks/generated/
- Integrates with
AgentTestHarness for execution and response capture
Related
Source
Agent-Testing Agent (ATA): Meta-Agent for Adversarial Behavioral Testing
https://arxiv.org/abs/2508.17393 — August 2025
Summary
ATA is a meta-agent that combines static analysis, designer interrogation, and persona-driven adversarial test generation with adaptive difficulty controlled by an LLM-as-judge scoring rubric. It generates behavioral test cases for conversational agents rather than relying on hand-written scenarios.
Applicability to Zeph
HIGH. Zeph's continuous improvement protocol (
.claude/rules/continuous-improvement.md) explicitly requires live agent testing but currently relies entirely on manual scenario crafting. The gap between CI unit tests and real behavioral testing is the #1 bottleneck in the CI cycle.Proposed integration
Build an ATA-style harness on top of
AgentTestHarness(already in the codebase from ARCH-08):summary_model) to generate adversarial prompts targeting:regressions.mdformat with expected behavior labelsLocation
zeph test-gen(or--test-gen).local/testing/playbooks/generated/AgentTestHarnessfor execution and response captureRelated
AgentTestHarness) — execution substrateregressions.md— generated adversarial prompts extend the regression catalog