Experiment code and data for:
From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts Sunil Prakash arXiv:2603.11781 · PDF
Multi-agent LLM systems typically interact through unstructured debate, majority voting, or rigid orchestration pipelines. None of these model deliberation — a phased process where differentiated participants exchange typed reasoning moves, preserve disagreements, and converge on an explicit outcome.
DCI treats collective reasoning as a first-class computational object:
| Component | Description |
|---|---|
| 4 Reasoning Archetypes | Framer (structures the problem), Explorer (generates alternatives), Challenger (stress-tests proposals), Integrator (synthesizes toward decision) |
| 14 Typed Epistemic Acts | propose, challenge, evidence, reframe, synthesize, concede, object, qualify, defer, escalate, poll, commit, dissent, reopen |
| Phased Sessions | Opening → Divergence → Convergence → Closure, with explicit phase transition rules |
| Shared Workspace | Tension register, option table, evidence log — all agents read/write a structured state |
| DCI-CF Algorithm | Convergent flow that always terminates, producing a decision packet with: selected option, residual objections, minority report, reopen conditions |
Evaluated on 45 tasks across 7 domains with 8 conditions (185 scored runs, 388 total JSONL-logged runs):
| Condition | n | Quality (0-10) |
|---|---|---|
| DCI (full) | 40 | 8.24 |
| Unstructured Debate | 25 | 8.43 |
| Majority Voting | 25 | 8.83 |
| Self-Consistency | 25 | 8.69 |
| Single Agent | 25 | 8.89 |
| Ablation: No Archetypes | 15 | 8.73 |
| Ablation: No Grammar | 15 | 8.29 |
| Ablation: No DCI-CF | 15 | 8.31 |
Finding: On non-routine tasks (n=40), DCI significantly outperforms unstructured debate (+0.95, 95% CI [+0.41, +1.54]). DCI excels on hidden-profile tasks requiring integration of partial perspectives (9.56 — highest score of any system on any domain), while failing on routine decisions (5.39), confirming strong task-dependence. However, DCI consumes ~62x the tokens of a single agent.
| Domain | Tasks | Description |
|---|---|---|
| Architectural Decision | 10 | Software architecture tradeoff analysis |
| Policy Analysis | 10 | Organizational and technology policy decisions |
| Hidden Profile | 5 | Decisions requiring combination of distributed information |
| Late Evidence | 5 | Decisions disrupted by new contradictory evidence |
| Risk Analysis | 5 | Risk-identification-heavy decisions |
| Routine Decision | 5 | Simple decisions (negative control) |
| Disagreement Decision | 5 | Decisions with legitimate expert disagreement |
dci-research/
├── src/ # DCI framework implementation
│ ├── agents/ # Delegate agents with archetype prompts
│ │ ├── archetypes.py # Framer, Explorer, Challenger, Integrator
│ │ ├── base.py # Base agent interface
│ │ └── llm_client.py # LLM provider abstraction
│ ├── workflow/ # DCI-CF session management
│ │ ├── dci_cf.py # Convergent flow algorithm
│ │ └── session.py # Phased session orchestration
│ ├── workspace/ # Shared workspace state
│ │ └── state.py # Tension register, option table, evidence log
│ ├── grammar/ # 14 typed epistemic acts
│ │ └── moves.py # Move schema and validation
│ ├── scoring/ # Convergence scoring
│ │ └── convergence.py # Termination conditions
│ └── baselines/ # 4 baseline implementations
│ ├── single_agent.py
│ ├── unstructured_debate.py
│ ├── voting.py
│ └── self_consistency.py
├── experiments/ # Experiment infrastructure
│ ├── runners/ # Automated experiment execution
│ ├── evaluation/ # LLM-as-judge scoring pipeline
│ ├── analysis/ # Results analysis + LaTeX table generation
│ ├── human_eval/ # Human evaluation protocol
│ └── configs/ # Experiment configurations
├── benchmarks/ # Task definitions
│ └── tasks.py # 45 tasks across 7 domains
├── results/ # Experiment data
│ ├── expanded_results.json # All 185 scored experiment results
│ ├── logs/ # 22 JSONL files (388 logged runs)
│ └── tables/ # Summary statistics per condition/domain
├── run_all_experiments.py # Main experiment runner
├── run_expanded_experiments.py # Extended 5-domain experiments
├── run_cross_judge.py # Cross-model judge validation
├── run_diverse_council.py # Diverse council experiments
├── smoke_test.py # Quick validation test
├── .env.example # API key template
└── requirements.txt
git clone https://github.com/sunilp/dci-research.git
cd dci-research
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Configure API keys
cp .env.example .env
# Edit .env with your Anthropic and/or Google Gemini API keys# Quick smoke test (1 task, 1 condition)
python smoke_test.py
# Full experiment suite
python run_all_experiments.py
# Extended 5-domain experiments
python run_expanded_experiments.py
# Cross-model judge validation
python run_cross_judge.pyimport json
# Load all results
results = json.load(open("results/expanded_results.json"))
# Per-condition averages
from collections import defaultdict
by_cond = defaultdict(list)
for r in results:
score = r["scores"]["overall"]
if score is not None:
by_cond[r["condition"]].append(float(score))
for cond, scores in sorted(by_cond.items()):
print(f"{cond:30s} n={len(scores):3d} mean={sum(scores)/len(scores):.2f}")Each entry in expanded_results.json:
{
"condition": "dci",
"task_id": "hidden-03",
"scores": {
"overall": 9.0,
"reasoning_depth": 8.0,
"risk_identification": 9.0,
"actionability": 8.0
},
"tokens": 45230,
"llm_calls": 12,
"rounds": 3,
"latency_ms": 89450,
"convergence_method": "consensus",
"decision": "..."
}- LDP (Lightweight Delegation Protocol): arXiv:2603.08852 · Code DCI provides the reasoning layer; LDP provides the delegation protocol for inter-agent communication.
@article{prakash2026dci,
title={From Debate to Deliberation: Structured Collective Reasoning
with Typed Epistemic Acts},
author={Prakash, Sunil},
journal={arXiv preprint arXiv:2603.11781},
year={2026}
}MIT