Reproducible multi-agent experiments,
from hypothesis to paper-ready results

Run ablations across strategies and seeds, replay executions from checkpoints, evaluate automatically, and export publication-ready tables —
without building custom experiment infrastructure.

Research infrastructure
shouldn't be your research

Multi-agent experiments require orchestration, evaluation, reproducibility, and statistical analysis. Most researchers build this from scratch for every paper — then throw it away.

Most

of research time goes to infrastructure instead of science — orchestrating runs, tracking seeds, formatting tables.

No

major agent frameworks with built-in experiment grids, checkpoint replay, and publication export

Real

token and cost waste when multi-agent experiments crash and must restart from scratch without checkpoints

What researchers actually need

Compare strategies

Swap reasoning patterns with one parameter and run the same task across conditions. Six built-in strategies: React, plan-and-execute, critic, reflection, consensus, debate.

agent = Agent(
  strategy="debate",  # swap to compare
  max_iterations=6,
)

Run seeds and sweeps

ExperimentGrid executes Cartesian combinations and collects results automatically. Parallel execution with durable checkpoints across every condition.

grid = ExperimentGrid(
  conditions={
    "strategy": ["react", "debate"],
  },
  seeds=[42, 123, 456],
)
results = await grid.run()

Export paper-ready results

Generate LaTeX booktabs tables, CSV for R/pandas, or structured JSON. Mean ± std computed automatically. Significance tests built in.

results.to_latex("table1.tex")
results.to_csv("results.csv")
results.compare(A, B)  # p-value

Replay exact executions

Reproduce a failed or interesting run from checkpoints instead of rebuilding it manually. Fork with modified inputs for ablation studies.

$ jamjet replay exec_abc
$ jamjet fork exec_abc \
  --override-input '{"model":"gemini"}'

Evaluate inside the workflow

Use judge-based, assertion-based, latency, and cost scoring in the same runtime. Eval nodes run inline — during execution, not after.

# workflow.yaml
check:
  type: eval
  on_fail: retry_with_feedback
  max_retries: 2

Start from a scaffold

One command to scaffold a complete experiment: agents, baselines, evaluation datasets, experiment runner, and results directory.

$ jamjet init my-study \
  --template research
# agents/ baselines/ experiments/
# evals/ results/ workflow.yaml

Start from a recipe

Debate vs ReAct benchmark

Compare adversarial and reactive reasoning on your dataset.

Reflection ablation

Test self-improving loops across models and iteration counts.

Model swap from checkpoint

Fork a completed run with a different model — instant ablation.

Judge-loop evaluation

Score, critique, and retry until quality passes a threshold.

Most agent frameworks prioritize apps
over experimental reproducibility

Capability JamJet LangGraph AutoGen Custom scripts
Multi-agent orchestration Native Native Native Possible with custom setup
Durable replay Native Possible with custom setup Possible with custom setup Possible with custom setup
Strategy comparison 6 native strategies Possible with custom setup Possible with custom setup Possible with custom setup
Experiment grid Native Possible with custom setup Possible with custom setup Possible with custom setup
LaTeX / CSV export Native Possible with custom setup Possible with custom setup Possible with custom setup
Checkpoint fork Native Possible with custom setup Possible with custom setup Possible with custom setup
Built-in eval harness Native External tooling required External tooling required Possible with custom setup
Per-node cost tracking Native Partial Partial Possible with custom setup
Statistical comparison Native (Welch's t-test) Possible with custom setup Possible with custom setup Possible with custom setup

From hypothesis to Methods section

1

Scaffold

jamjet init --template research

2 min
2

Define agents

Tools, strategies, instructions

15 min
3

Run experiments

ExperimentGrid across conditions

automated
4

Export results

LaTeX tables, CSV, statistical tests

1 command
5

Reproduce

jamjet replay from checkpoint

exact

One research afternoon, end to end

1

Compare 6 strategies on your dataset

grid = ExperimentGrid(
  conditions={"strategy": ["react", "plan_and_execute",
    "critic", "reflection", "consensus", "debate"]},
  seeds=[42, 123, 456],
)
results = await grid.run()
2

Export a LaTeX table for your paper

results.to_latex("table1.tex", caption="Strategy comparison")
# Outputs booktabs table with mean +/- std per condition
3

Replay a failed condition — no re-running prior steps

$ jamjet replay exec_debate_seed42
# Restores from checkpoint. Saves tokens + cost.
4

Compute significance between conditions

results.compare("debate", "react")
# => {p_value: 0.023, effect_size: 0.41, significant: true}
5

Fork for an ablation study

$ jamjet fork exec_debate_seed42 \
  --override-input '{"model":"gpt-4o"}'
# Same execution, different model. Instant ablation.

The same durability that makes agents reliable in production makes experiments reproducible in research. See the quickstart →

What a result looks like

Task: summarize a 2,000-word policy document. 6 strategies, 3 seeds each. Scored by LLM-judge (0–1). Local Ollama, Llama 3.

Strategy Score (mean ± std) Tokens Latency Cost
react 0.71 ± 0.04 1,240 2.1s $0.002
plan_and_execute 0.78 ± 0.03 1,890 3.4s $0.003
critic 0.82 ± 0.05 2,410 4.2s $0.004
reflection 0.84 ± 0.02 3,100 5.8s $0.005
consensus 0.86 ± 0.03 4,520 7.1s $0.007
debate 0.89 ± 0.02 5,880 9.3s $0.009

debate vs. react: p = 0.012 (Welch's t-test, n = 3 seeds). This table was generated by results.to_latex("table1.tex") — zero manual formatting.

Illustrative results from internal testing. Your numbers will vary by model, task, and hardware.

Why not just scripts?

Custom scripts work for one-off experiments. They break down when you need to reproduce, compare, or build on prior work.

Custom scripts

  • Reproducibility depends on discipline, not tooling
  • No checkpoint — a crash reruns everything from scratch
  • Manual experiment matrix loops with ad-hoc seed handling
  • Result formatting is copy-paste or custom code
  • No built-in cost tracking — discovered after the bill
  • Comparing strategies requires rewriting orchestration code

JamJet

  • Every execution event-sourced — replay from any checkpoint
  • Crash recovery built in — resume exactly where it stopped
  • ExperimentGrid handles conditions × seeds automatically
  • One call to to_latex(), to_csv(), or to_json()
  • Per-node token and cost tracking, visible in real time
  • Change strategy="debate" to strategy="react" — same agent, different reasoning

Patterns from published research

arXiv 2603.08852

LLM Delegate Protocol

Identity-aware agent routing with quality scores, governed sessions, and provenance tracking. JamJet integration via ProtocolAdapter trait.

agent routing identity provenance
arXiv 2603.11781

Deliberative Collective Intelligence

Structured multi-agent deliberation with four reasoning archetypes and typed epistemic acts. Patterns now available as JamJet strategies and examples.

multi-agent deliberation archetypes

Built for how you work

Multi-agent systems

AAMAS, NeurIPS workshops

Orchestration + evaluation + reproducibility

LLM reasoning

CoT, ToT, debate, reflection

Strategy parameter makes A/B testing trivial

Tool-augmented LLMs

ReAct, Toolformer

MCP-native tool integration

AI safety & alignment

HITL, guardrails

Human-in-the-loop + policy engine

Evaluation & benchmarks

AgentBench, GAIA

Eval harness + batch runner + CI gates

Agent communication

Negotiation, persuasion

Native A2A + LDP protocol support

Start your experiment

From pip install to running multi-agent experiments in under 5 minutes.

$ pip install jamjet && jamjet init my-study --template research
Read the quickstart Browse examples