For researchers

Reproducible multi-agent experiments,
from hypothesis to paper-ready results

Run ablations across strategies and seeds, replay executions from checkpoints, evaluate automatically, and export publication-ready tables —
without building custom experiment infrastructure.

Get started See capabilities

Why we built this

Research infrastructure
shouldn't be your research

Multi-agent experiments require orchestration, evaluation, reproducibility, and statistical analysis. Most researchers build this from scratch for every paper — then throw it away.

Most

of research time goes to infrastructure instead of science — orchestrating runs, tracking seeds, formatting tables.

major agent frameworks with built-in experiment grids, checkpoint replay, and publication export

Real

token and cost waste when multi-agent experiments crash and must restart from scratch without checkpoints

Capabilities

What researchers actually need

Compare strategies

Swap reasoning patterns with one parameter and run the same task across conditions. Six built-in strategies: React, plan-and-execute, critic, reflection, consensus, debate.

agent = Agent(
  strategy="debate",  # swap to compare
  max_iterations=6,
)

Run seeds and sweeps

ExperimentGrid executes Cartesian combinations and collects results automatically. Parallel execution with durable checkpoints across every condition.

grid = ExperimentGrid(
  conditions={
    "strategy": ["react", "debate"],
  },
  seeds=[42, 123, 456],
)
results = await grid.run()

Export paper-ready results

Generate LaTeX booktabs tables, CSV for R/pandas, or structured JSON. Mean ± std computed automatically. Significance tests built in.

results.to_latex("table1.tex")
results.to_csv("results.csv")
results.compare(A, B)  # p-value

Replay exact executions

Reproduce a failed or interesting run from checkpoints instead of rebuilding it manually. Fork with modified inputs for ablation studies.

$ jamjet replay exec_abc
$ jamjet fork exec_abc \
  --override-input '{"model":"gemini"}'

Evaluate inside the workflow

Use judge-based, assertion-based, latency, and cost scoring in the same runtime. Eval nodes run inline — during execution, not after.

# workflow.yaml
check:
  type: eval
  on_fail: retry_with_feedback
  max_retries: 2

Start from a scaffold

One command to scaffold a complete experiment: agents, baselines, evaluation datasets, experiment runner, and results directory.

$ jamjet init my-study \
  --template research
# agents/ baselines/ experiments/
# evals/ results/ workflow.yaml

Recipes

Start from a recipe

Debate vs ReAct benchmark

Compare adversarial and reactive reasoning on your dataset.

Reflection ablation

Test self-improving loops across models and iteration counts.

Model swap from checkpoint

Fork a completed run with a different model — instant ablation.

Judge-loop evaluation

Score, critique, and retry until quality passes a threshold.

Comparison

Most agent frameworks prioritize apps
over experimental reproducibility

Capability	JamJet	LangGraph	AutoGen	Custom scripts
Multi-agent orchestration	Native	Native	Native	Possible with custom setup
Durable replay	Native	Possible with custom setup	Possible with custom setup	Possible with custom setup
Strategy comparison	6 native strategies	Possible with custom setup	Possible with custom setup	Possible with custom setup
Experiment grid	Native	Possible with custom setup	Possible with custom setup	Possible with custom setup
LaTeX / CSV export	Native	Possible with custom setup	Possible with custom setup	Possible with custom setup
Checkpoint fork	Native	Possible with custom setup	Possible with custom setup	Possible with custom setup
Built-in eval harness	Native	External tooling required	External tooling required	Possible with custom setup
Per-node cost tracking	Native	Partial	Partial	Possible with custom setup
Statistical comparison	Native (Welch's t-test)	Possible with custom setup	Possible with custom setup	Possible with custom setup

Workflow

From hypothesis to Methods section

Scaffold

jamjet init --template research

2 min

Define agents

Tools, strategies, instructions

15 min

Run experiments

ExperimentGrid across conditions

automated

Export results

LaTeX tables, CSV, statistical tests

1 command

Reproduce

jamjet replay from checkpoint

exact

In practice

One research afternoon, end to end

Compare 6 strategies on your dataset

grid = ExperimentGrid(
  conditions={"strategy": ["react", "plan_and_execute",
    "critic", "reflection", "consensus", "debate"]},
  seeds=[42, 123, 456],
)
results = await grid.run()

Export a LaTeX table for your paper

results.to_latex("table1.tex", caption="Strategy comparison")
# Outputs booktabs table with mean +/- std per condition

Replay a failed condition — no re-running prior steps

$ jamjet replay exec_debate_seed42
# Restores from checkpoint. Saves tokens + cost.

Compute significance between conditions

results.compare("debate", "react")
# => {p_value: 0.023, effect_size: 0.41, significant: true}

Fork for an ablation study

$ jamjet fork exec_debate_seed42 \
  --override-input '{"model":"gpt-4o"}'
# Same execution, different model. Instant ablation.

The same durability that makes agents reliable in production makes experiments reproducible in research. See the quickstart →

Sample output

What a result looks like

Task: summarize a 2,000-word policy document. 6 strategies, 3 seeds each. Scored by LLM-judge (0–1). Local Ollama, Llama 3.

Strategy	Score (mean ± std)	Tokens	Latency	Cost
react	0.71 ± 0.04	1,240	2.1s	$0.002
plan_and_execute	0.78 ± 0.03	1,890	3.4s	$0.003
critic	0.82 ± 0.05	2,410	4.2s	$0.004
reflection	0.84 ± 0.02	3,100	5.8s	$0.005
consensus	0.86 ± 0.03	4,520	7.1s	$0.007
debate	0.89 ± 0.02	5,880	9.3s	$0.009

debate vs. react: p = 0.012 (Welch's t-test, n = 3 seeds). This table was generated by results.to_latex("table1.tex") — zero manual formatting.

Illustrative results from internal testing. Your numbers will vary by model, task, and hardware.

Tradeoffs

Why not just scripts?

Custom scripts work for one-off experiments. They break down when you need to reproduce, compare, or build on prior work.

Custom scripts

Reproducibility depends on discipline, not tooling
No checkpoint — a crash reruns everything from scratch
Manual experiment matrix loops with ad-hoc seed handling
Result formatting is copy-paste or custom code
No built-in cost tracking — discovered after the bill
Comparing strategies requires rewriting orchestration code

JamJet

Every execution event-sourced — replay from any checkpoint
Crash recovery built in — resume exactly where it stopped
ExperimentGrid handles conditions × seeds automatically
One call to to_latex(), to_csv(), or to_json()
Per-node token and cost tracking, visible in real time
Change strategy="debate" to strategy="react" — same agent, different reasoning

Validated in research

Patterns from published research

arXiv 2603.08852

LLM Delegate Protocol

Identity-aware agent routing with quality scores, governed sessions, and provenance tracking. JamJet integration via ProtocolAdapter trait.

agent routing identity provenance

arXiv 2603.11781

Deliberative Collective Intelligence

Structured multi-agent deliberation with four reasoning archetypes and typed epistemic acts. Patterns now available as JamJet strategies and examples.

multi-agent deliberation archetypes

Communities

Built for how you work

Multi-agent systems

AAMAS, NeurIPS workshops

Orchestration + evaluation + reproducibility

LLM reasoning

CoT, ToT, debate, reflection

Strategy parameter makes A/B testing trivial

Tool-augmented LLMs

ReAct, Toolformer

MCP-native tool integration

AI safety & alignment

HITL, guardrails

Human-in-the-loop + policy engine

Evaluation & benchmarks

AgentBench, GAIA

Eval harness + batch runner + CI gates

Agent communication

Negotiation, persuasion

Native A2A + LDP protocol support

Start your experiment

From pip install to running multi-agent experiments in under 5 minutes.

$ pip install jamjet && jamjet init my-study --template research

Read the quickstart Browse examples

Reproducible multi-agent experiments,from hypothesis to paper-ready results

Research infrastructureshouldn't be your research