Reproducible multi-agent experiments,
from hypothesis to paper-ready results
Run ablations across strategies and seeds, replay executions from checkpoints,
evaluate automatically, and export publication-ready tables —
without building custom experiment infrastructure.
Research infrastructure
shouldn't be your research
Multi-agent experiments require orchestration, evaluation, reproducibility, and statistical analysis. Most researchers build this from scratch for every paper — then throw it away.
of research time goes to infrastructure instead of science — orchestrating runs, tracking seeds, formatting tables.
major agent frameworks with built-in experiment grids, checkpoint replay, and publication export
token and cost waste when multi-agent experiments crash and must restart from scratch without checkpoints
What researchers actually need
Compare strategies
Swap reasoning patterns with one parameter and run the same task across conditions. Six built-in strategies: React, plan-and-execute, critic, reflection, consensus, debate.
agent = Agent( strategy="debate", # swap to compare max_iterations=6, )
Run seeds and sweeps
ExperimentGrid executes Cartesian combinations and collects results automatically. Parallel execution with durable checkpoints across every condition.
grid = ExperimentGrid( conditions={ "strategy": ["react", "debate"], }, seeds=[42, 123, 456], ) results = await grid.run()
Export paper-ready results
Generate LaTeX booktabs tables, CSV for R/pandas, or structured JSON. Mean ± std computed automatically. Significance tests built in.
results.to_latex("table1.tex") results.to_csv("results.csv") results.compare(A, B) # p-value
Replay exact executions
Reproduce a failed or interesting run from checkpoints instead of rebuilding it manually. Fork with modified inputs for ablation studies.
$ jamjet replay exec_abc $ jamjet fork exec_abc \ --override-input '{"model":"gemini"}'
Evaluate inside the workflow
Use judge-based, assertion-based, latency, and cost scoring in the same runtime. Eval nodes run inline — during execution, not after.
# workflow.yaml check: type: eval on_fail: retry_with_feedback max_retries: 2
Start from a scaffold
One command to scaffold a complete experiment: agents, baselines, evaluation datasets, experiment runner, and results directory.
$ jamjet init my-study \ --template research # agents/ baselines/ experiments/ # evals/ results/ workflow.yaml
Start from a recipe
Compare adversarial and reactive reasoning on your dataset.
Test self-improving loops across models and iteration counts.
Fork a completed run with a different model — instant ablation.
Score, critique, and retry until quality passes a threshold.
Most agent frameworks prioritize apps
over experimental reproducibility
| Capability | JamJet | LangGraph | AutoGen | Custom scripts |
|---|---|---|---|---|
| Multi-agent orchestration | Native | Native | Native | Possible with custom setup |
| Durable replay | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Strategy comparison | 6 native strategies | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Experiment grid | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| LaTeX / CSV export | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Checkpoint fork | Native | Possible with custom setup | Possible with custom setup | Possible with custom setup |
| Built-in eval harness | Native | External tooling required | External tooling required | Possible with custom setup |
| Per-node cost tracking | Native | Partial | Partial | Possible with custom setup |
| Statistical comparison | Native (Welch's t-test) | Possible with custom setup | Possible with custom setup | Possible with custom setup |
From hypothesis to Methods section
Scaffold
jamjet init --template research
Define agents
Tools, strategies, instructions
15 minRun experiments
ExperimentGrid across conditions
Export results
LaTeX tables, CSV, statistical tests
1 commandReproduce
jamjet replay from checkpoint
One research afternoon, end to end
Compare 6 strategies on your dataset
grid = ExperimentGrid( conditions={"strategy": ["react", "plan_and_execute", "critic", "reflection", "consensus", "debate"]}, seeds=[42, 123, 456], ) results = await grid.run()
Export a LaTeX table for your paper
results.to_latex("table1.tex", caption="Strategy comparison") # Outputs booktabs table with mean +/- std per condition
Replay a failed condition — no re-running prior steps
$ jamjet replay exec_debate_seed42 # Restores from checkpoint. Saves tokens + cost.
Compute significance between conditions
results.compare("debate", "react") # => {p_value: 0.023, effect_size: 0.41, significant: true}
Fork for an ablation study
$ jamjet fork exec_debate_seed42 \ --override-input '{"model":"gpt-4o"}' # Same execution, different model. Instant ablation.
The same durability that makes agents reliable in production makes experiments reproducible in research. See the quickstart →
What a result looks like
Task: summarize a 2,000-word policy document. 6 strategies, 3 seeds each. Scored by LLM-judge (0–1). Local Ollama, Llama 3.
| Strategy | Score (mean ± std) | Tokens | Latency | Cost |
|---|---|---|---|---|
| react | 0.71 ± 0.04 | 1,240 | 2.1s | $0.002 |
| plan_and_execute | 0.78 ± 0.03 | 1,890 | 3.4s | $0.003 |
| critic | 0.82 ± 0.05 | 2,410 | 4.2s | $0.004 |
| reflection | 0.84 ± 0.02 | 3,100 | 5.8s | $0.005 |
| consensus | 0.86 ± 0.03 | 4,520 | 7.1s | $0.007 |
| debate | 0.89 ± 0.02 | 5,880 | 9.3s | $0.009 |
debate vs. react: p = 0.012 (Welch's t-test, n = 3 seeds). This table was generated by results.to_latex("table1.tex") — zero manual formatting.
Illustrative results from internal testing. Your numbers will vary by model, task, and hardware.
Why not just scripts?
Custom scripts work for one-off experiments. They break down when you need to reproduce, compare, or build on prior work.
Custom scripts
- Reproducibility depends on discipline, not tooling
- No checkpoint — a crash reruns everything from scratch
- Manual experiment matrix loops with ad-hoc seed handling
- Result formatting is copy-paste or custom code
- No built-in cost tracking — discovered after the bill
- Comparing strategies requires rewriting orchestration code
JamJet
- Every execution event-sourced — replay from any checkpoint
- Crash recovery built in — resume exactly where it stopped
ExperimentGridhandles conditions × seeds automatically- One call to
to_latex(),to_csv(), orto_json() - Per-node token and cost tracking, visible in real time
- Change
strategy="debate"tostrategy="react"— same agent, different reasoning
Patterns from published research
LLM Delegate Protocol
Identity-aware agent routing with quality scores, governed sessions, and provenance tracking. JamJet integration via ProtocolAdapter trait.
Deliberative Collective Intelligence
Structured multi-agent deliberation with four reasoning archetypes and typed epistemic acts. Patterns now available as JamJet strategies and examples.
Built for how you work
Multi-agent systems
AAMAS, NeurIPS workshops
Orchestration + evaluation + reproducibilityLLM reasoning
CoT, ToT, debate, reflection
Strategy parameter makes A/B testing trivialTool-augmented LLMs
ReAct, Toolformer
MCP-native tool integrationAI safety & alignment
HITL, guardrails
Human-in-the-loop + policy engineEvaluation & benchmarks
AgentBench, GAIA
Eval harness + batch runner + CI gatesAgent communication
Negotiation, persuasion
Native A2A + LDP protocol supportStart your experiment
From pip install to running multi-agent experiments in under 5 minutes.