DeltaZero: Context Rotの本質解決 — 構造的矛盾代謝レイヤー

Frontier 3社（Anthropic / OpenAI / Google）全てで ON >> OFF を確認

Frontier Fact Recall（T30, n=3, Sonnet judge）

Model	Vendor	ON	NC	OFF	ON-OFF
Sonnet 4.6	Anthropic	95.6%	97.8%	46.7%	+48.9pp
GPT-4o	OpenAI	91.1%	28.9%	0.0%	+91.1pp
Gemini 3.1 FL	Google	88.9%	88.9%	4.4%	+84.5pp

GPT-4o は代謝なしで fact recall 0.0%（事実を一つも思い出せない）。代謝ありで 91.1%。

生存方程式 S = μ × e^{-δ×k}（Lean 4形式証明済み）で δ を nats 単位で定量
外部認知睡眠で矛盾を temporal pair preservation（時間軸ペア保持）しながら自動解決
gemma3:27b（n=3、180 turns）でも +52.2pp（21.1% → 73.3%）、Kruskal-Wallis p=0.027

補足:

Decoupled ablation: deepseek-r1:14b dialogue + Sonnet metabolism（n=1） → T30 overall: ON 73.3% / NC 73.3% / OFF 35.6% → metabolism / resolver quality と dialogue model ceiling の切り分け evidence
qwen3:32b の remote rerun は実行設計の失敗が混在 → deferred → REMOTE_STAGEB_GUARDRAILS.md に再発防止と解釈ルール明記済み
benchmark artifact と quarantine 方針: raw 保持のまま別ファイルで公開 → benchmark_integrity_audit_2026-04-03.md / evaluation_transparency_note_2026-04-03.md

Apache 2.0 / Ollama-first runtime + frontier API experiment support

論文（更新版） delta-lint（構造矛盾検出ツール）

DeltaZero

LLMs don't break because context is long. They break because contradictions accumulate. DeltaZero gives them a sleep cycle to fix that.

The Problem

When LLMs accumulate contradictory information in their context, reasoning accuracy collapses. This happens regardless of model size or context window length:

Model	Vendor	Context	δ=0 (clean)	δ>0 (contradictions)	Drop	Note
GPT-4o-mini	OpenAI	96K	100%	10.4%	-89.6pp	Near-total collapse
Gemini 2.5 Flash	Google	64K	100%	0%	-100pp	Complete collapse
Gemini 3.1 FL	Google	1M	88.6%	40.8%	-47.8pp	1M window doesn't help
Sonnet 4 (prev)	Anthropic	8K	100%	74.0%	-26.0pp
Sonnet 4.6	Anthropic	128K	100%	100%*	0pp	*Recognition only — see below
Llama 3.1:8b	Meta	8K	34.0%	2.7%	-31.3pp	Small model baseline

*Sonnet 4.6 recognizes all contradictions (recognition accuracy = 100%), but strict output parsing captures only 82.6% due to response format variability. This is a measurement artifact, not a model failure. See Exp35 analysis for details.

Key insight: Google's 1M-token context window still drops 47.8pp with contradictions. Making the window bigger doesn't help — the contradictions are still there. Sonnet 4.6 is the only model tested that resists, but even it doesn't manage the contradictions — it just tolerates them.

The Solution

DeltaZero adds a "metabolism" layer around any LLM. Like human sleep consolidates memory, DeltaZero processes knowledge during idle time:

Classify — Extract facts and rules from conversation
Detect — Find contradictions between old and new information
Resolve — Integrate conflicting knowledge (temporal changes, scope differences, direct contradictions)
Forget — Demote stale or resolved information

The LLM itself is not modified. DeltaZero is Ollama-first for local runs, and the experiment harness also supports frontier API models.

Results

Metabolism ON vs OFF (180 turns, 8 models, corrected judgment)

11 paired comparisons across 8 open-source models (8B–27B):

9 ON wins / 1 OFF win / 1 TIE
p = 0.0107 (one-sided sign test) — statistically significant
Largest effect: +42.2pp (mistral-nemo:12b)
Effect grows over time: 3 pairs flipped from OFF→ON between T90 and T180

Three-Condition Experiment: gemma3:27b (n=3, 180 turns, corrected)

Condition	Trial 2	Trial 3	Trial 4	Mean (SD)
Metabolism ON	22/30	20/30	24/30	73.3% (6.7)
No contradictions (δ=0)	18/30	15/30	18/30	56.7% (5.8)
Metabolism OFF	5/30	8/30	6/30	21.1% (5.1)

Comparison	Difference	Nonparametric note	Cohen's d*
ON vs OFF	+52.2pp	Mann-Whitney p = 0.05	d = 8.80
ON vs NC	+16.7pp	Mann-Whitney p = 0.05	d = 2.67
NC vs OFF	+35.6pp	Mann-Whitney p = 0.05	d = 6.53

ON > NC in all 3 trials (+4, +5, +6). This is not noise. Global three-group test: Kruskal-Wallis p = 0.027. * Cohen's d is descriptive only at n = 3.

The Knowledge Anchoring Effect

The δ=0 condition was designed as the theoretical ceiling. Instead, ON exceeded it. Why?

When contradictions are injected, the Resolver preserves both old and new claims as linked pairs. These pairs act as anchors — they keep the original facts retrievable via vector search. Without contradictions (δ=0), facts gradually become buried under 180 turns of conversation and fall out of search results.

Metabolism provides two distinct benefits:

Contradiction resolution (ON vs OFF, +52.2pp) — prevents collapse
Knowledge anchoring (ON vs NC, +16.7pp) — prevents forgetting

For full analysis, see docs/context_rot_analysis.md.

Temporal Integration: Controlled Ablation (mistral-nemo:12b)

Condition	Code	Overall
ON	New (temporal integration)	57.8%
ON	Old (pre-temporal integration)	8.9%
OFF	New	15.6%
OFF	Old	22.2%

OFF is unchanged across code versions (15.6% vs 22.2%). ON jumps from 8.9% to 57.8% (+48.9pp). The only variable is the code → causal proof of temporal integration's effect.

Architecture

User ──→ Dialogue (P1: read-only) ──→ Response
              │
              │ Conversation log accumulation
              ▼
        ┌──────────────────┐
        │ Metabolism Pipeline │  ← Runs during idle time ("sleep")
        │                    │
        │ 1. Extract         │  Classify as fact/rule/preference
        │ 2. Detect          │  Pairwise contradiction detection via LLM
        │ 3. Resolve         │  Preserve contradiction pairs with temporal links
        │ 4. Forget          │  Demote rules unreferenced for 90 days
        │ 5. Monitor         │  S-value health check, auto-rollback on drops
        └──────────────────┘

4-Layer Memory

Layer	Name	Role	Storage
L1	Working	Current conversation context	deque + SQLite
L2	Pending	Unprocessed conversation logs	SQLite
L3	Active Logic	User values, rules, contradiction pairs	ChromaDB
L4	Dormant Fact	Facts and demoted rules	SQLite + ChromaDB

Design Principles

Principle	Description
P1	Read-only + Append-only during dialogue. Metabolism runs only when idle
P2	Prioritize delta resolution. Reducing delta has exponential effect (S = μ × e^(-δ × k))
P3	Do not integrate low-confidence items
P4	Let logic decay, preserve facts (90-day TTL demotion)
P5	If it breaks, roll it back (pre-metabolism snapshots + auto-rollback)

Setup

Prerequisites

Python 3.12+
Ollama (local LLM inference)

Installation

pip install -e .

For API-backed experiments:

pip install -e ".[cloud-llm]"

Run

python src/main.py

Test

pytest tests/ -v

Project Structure

delta-zero/
├── src/
│   ├── core/           # config, ports, logger
│   ├── adapters/       # ollama, sqlite, chroma, embedding
│   ├── memory/         # 4-layer memory (L1-L4)
│   ├── dialogue/       # dialogue agent, temporal conflict formatting
│   ├── metabolism/     # metabolism pipeline
│   │   ├── extractor   # knowledge classification + fact promotion
│   │   ├── resolver    # contradiction detection + pair preservation
│   │   ├── demoter     # 90-day TTL demotion (L3→L4)
│   │   └── garbage     # processed log deletion
│   ├── health/         # S-value monitoring, snapshots, auto-rollback
│   ├── scheduler.py    # dialogue/metabolism mode switching
│   └── main.py
├── tests/              # pytest suite
├── scripts/
│   └── experiment_runner.py  # controlled experiment runner
├── config/             # experiment configurations (8 models)
└── docs/               # analysis reports, pitch materials

Theoretical Foundation

Survival Equation: S = μ × e^(-δ × SCALE_FACTOR) — survival potential under cumulative contradiction δ
Paper 3: "Cognitive Sleep for LLMs" — this system's experimental validation
Paper 1: delta-survival-papers — survival equation S = μ × e^{-δ} (Lean 4 formal proofs, sorry = 0, axiom = 0)
OSF Project: osf.io/mdh7b — all papers, data, and code in one place
Key insight: Context rot is caused by contradiction accumulation, not context length

Related Projects

delta-prune — Lightweight middleware version: scan and clean contradictions before sending to any LLM API
delta-survival-papers — Survival Equation paper and Lean 4 formal proofs

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.claude		.claude
config		config
data		data
docs		docs
logs		logs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
AWS_ANCHORING_EXPERIMENT.md		AWS_ANCHORING_EXPERIMENT.md
LICENSE		LICENSE
README.md		README.md
REMOTE_STAGEB_GUARDRAILS.md		REMOTE_STAGEB_GUARDRAILS.md
benchmark_integrity_audit_2026-04-03.md		benchmark_integrity_audit_2026-04-03.md
evaluation_transparency_note_2026-04-03.md		evaluation_transparency_note_2026-04-03.md
experiment.log		experiment.log
frontier_validation_report.md		frontier_validation_report.md
paper1_main.pdf		paper1_main.pdf
pyproject.toml		pyproject.toml
test_loop.py		test_loop.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeltaZero: Context Rotの本質解決 — 構造的矛盾代謝レイヤー

Frontier Fact Recall（T30, n=3, Sonnet judge）

DeltaZero

The Problem

The Solution

Results

Metabolism ON vs OFF (180 turns, 8 models, corrected judgment)

Three-Condition Experiment: gemma3:27b (n=3, 180 turns, corrected)

The Knowledge Anchoring Effect

Temporal Integration: Controlled Ablation (mistral-nemo:12b)

Architecture

4-Layer Memory

Design Principles

Setup

Prerequisites

Installation

Run

Test

Project Structure

Theoretical Foundation

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeltaZero: Context Rotの本質解決 — 構造的矛盾代謝レイヤー

Frontier Fact Recall（T30, n=3, Sonnet judge）

DeltaZero

The Problem

The Solution

Results

Metabolism ON vs OFF (180 turns, 8 models, corrected judgment)

Three-Condition Experiment: gemma3:27b (n=3, 180 turns, corrected)

The Knowledge Anchoring Effect

Temporal Integration: Controlled Ablation (mistral-nemo:12b)

Architecture

4-Layer Memory

Design Principles

Setup

Prerequisites

Installation

Run

Test

Project Structure

Theoretical Foundation

Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages