Frontier 3社(Anthropic / OpenAI / Google)全てで ON >> OFF を確認
| Model | Vendor | ON | NC | OFF | ON-OFF |
|---|---|---|---|---|---|
| Sonnet 4.6 | Anthropic | 95.6% | 97.8% | 46.7% | +48.9pp |
| GPT-4o | OpenAI | 91.1% | 28.9% | 0.0% | +91.1pp |
| Gemini 3.1 FL | 88.9% | 88.9% | 4.4% | +84.5pp |
GPT-4o は代謝なしで fact recall 0.0%(事実を一つも思い出せない)。代謝ありで 91.1%。
- 生存方程式
S = μ × e^{-δ×k}(Lean 4形式証明済み)で δ を nats 単位で定量 - 外部認知睡眠で矛盾を temporal pair preservation(時間軸ペア保持)しながら自動解決
- gemma3:27b(n=3、180 turns)でも +52.2pp(21.1% → 73.3%)、Kruskal-Wallis p=0.027
補足:
- Decoupled ablation:
deepseek-r1:14bdialogue +Sonnetmetabolism(n=1) →T30 overall:ON 73.3%/NC 73.3%/OFF 35.6%→ metabolism / resolver quality と dialogue model ceiling の切り分け evidence qwen3:32bの remote rerun は実行設計の失敗が混在 → deferred →REMOTE_STAGEB_GUARDRAILS.mdに再発防止と解釈ルール明記済み- benchmark artifact と quarantine 方針: raw 保持のまま別ファイルで公開
→
benchmark_integrity_audit_2026-04-03.md/evaluation_transparency_note_2026-04-03.md
Apache 2.0 / Ollama-first runtime + frontier API experiment support
LLMs don't break because context is long. They break because contradictions accumulate. DeltaZero gives them a sleep cycle to fix that.
When LLMs accumulate contradictory information in their context, reasoning accuracy collapses. This happens regardless of model size or context window length:
| Model | Vendor | Context | δ=0 (clean) | δ>0 (contradictions) | Drop | Note |
|---|---|---|---|---|---|---|
| GPT-4o-mini | OpenAI | 96K | 100% | 10.4% | -89.6pp | Near-total collapse |
| Gemini 2.5 Flash | 64K | 100% | 0% | -100pp | Complete collapse | |
| Gemini 3.1 FL | 1M | 88.6% | 40.8% | -47.8pp | 1M window doesn't help | |
| Sonnet 4 (prev) | Anthropic | 8K | 100% | 74.0% | -26.0pp | |
| Sonnet 4.6 | Anthropic | 128K | 100% | 100%* | 0pp | *Recognition only — see below |
| Llama 3.1:8b | Meta | 8K | 34.0% | 2.7% | -31.3pp | Small model baseline |
*Sonnet 4.6 recognizes all contradictions (recognition accuracy = 100%), but strict output parsing captures only 82.6% due to response format variability. This is a measurement artifact, not a model failure. See Exp35 analysis for details.
Key insight: Google's 1M-token context window still drops 47.8pp with contradictions. Making the window bigger doesn't help — the contradictions are still there. Sonnet 4.6 is the only model tested that resists, but even it doesn't manage the contradictions — it just tolerates them.
DeltaZero adds a "metabolism" layer around any LLM. Like human sleep consolidates memory, DeltaZero processes knowledge during idle time:
- Classify — Extract facts and rules from conversation
- Detect — Find contradictions between old and new information
- Resolve — Integrate conflicting knowledge (temporal changes, scope differences, direct contradictions)
- Forget — Demote stale or resolved information
The LLM itself is not modified. DeltaZero is Ollama-first for local runs, and the experiment harness also supports frontier API models.
11 paired comparisons across 8 open-source models (8B–27B):
- 9 ON wins / 1 OFF win / 1 TIE
- p = 0.0107 (one-sided sign test) — statistically significant
- Largest effect: +42.2pp (mistral-nemo:12b)
- Effect grows over time: 3 pairs flipped from OFF→ON between T90 and T180
| Condition | Trial 2 | Trial 3 | Trial 4 | Mean (SD) |
|---|---|---|---|---|
| Metabolism ON | 22/30 | 20/30 | 24/30 | 73.3% (6.7) |
| No contradictions (δ=0) | 18/30 | 15/30 | 18/30 | 56.7% (5.8) |
| Metabolism OFF | 5/30 | 8/30 | 6/30 | 21.1% (5.1) |
| Comparison | Difference | Nonparametric note | Cohen's d* |
|---|---|---|---|
| ON vs OFF | +52.2pp | Mann-Whitney p = 0.05 | d = 8.80 |
| ON vs NC | +16.7pp | Mann-Whitney p = 0.05 | d = 2.67 |
| NC vs OFF | +35.6pp | Mann-Whitney p = 0.05 | d = 6.53 |
ON > NC in all 3 trials (+4, +5, +6). This is not noise.
Global three-group test: Kruskal-Wallis p = 0.027. * Cohen's d is descriptive only at n = 3.
The δ=0 condition was designed as the theoretical ceiling. Instead, ON exceeded it. Why?
When contradictions are injected, the Resolver preserves both old and new claims as linked pairs. These pairs act as anchors — they keep the original facts retrievable via vector search. Without contradictions (δ=0), facts gradually become buried under 180 turns of conversation and fall out of search results.
Metabolism provides two distinct benefits:
- Contradiction resolution (ON vs OFF, +52.2pp) — prevents collapse
- Knowledge anchoring (ON vs NC, +16.7pp) — prevents forgetting
For full analysis, see docs/context_rot_analysis.md.
| Condition | Code | Overall |
|---|---|---|
| ON | New (temporal integration) | 57.8% |
| ON | Old (pre-temporal integration) | 8.9% |
| OFF | New | 15.6% |
| OFF | Old | 22.2% |
OFF is unchanged across code versions (15.6% vs 22.2%). ON jumps from 8.9% to 57.8% (+48.9pp). The only variable is the code → causal proof of temporal integration's effect.
User ──→ Dialogue (P1: read-only) ──→ Response
│
│ Conversation log accumulation
▼
┌──────────────────┐
│ Metabolism Pipeline │ ← Runs during idle time ("sleep")
│ │
│ 1. Extract │ Classify as fact/rule/preference
│ 2. Detect │ Pairwise contradiction detection via LLM
│ 3. Resolve │ Preserve contradiction pairs with temporal links
│ 4. Forget │ Demote rules unreferenced for 90 days
│ 5. Monitor │ S-value health check, auto-rollback on drops
└──────────────────┘
| Layer | Name | Role | Storage |
|---|---|---|---|
| L1 | Working | Current conversation context | deque + SQLite |
| L2 | Pending | Unprocessed conversation logs | SQLite |
| L3 | Active Logic | User values, rules, contradiction pairs | ChromaDB |
| L4 | Dormant Fact | Facts and demoted rules | SQLite + ChromaDB |
| Principle | Description |
|---|---|
| P1 | Read-only + Append-only during dialogue. Metabolism runs only when idle |
| P2 | Prioritize delta resolution. Reducing delta has exponential effect (S = μ × e^(-δ × k)) |
| P3 | Do not integrate low-confidence items |
| P4 | Let logic decay, preserve facts (90-day TTL demotion) |
| P5 | If it breaks, roll it back (pre-metabolism snapshots + auto-rollback) |
- Python 3.12+
- Ollama (local LLM inference)
pip install -e .For API-backed experiments:
pip install -e ".[cloud-llm]"python src/main.pypytest tests/ -vdelta-zero/
├── src/
│ ├── core/ # config, ports, logger
│ ├── adapters/ # ollama, sqlite, chroma, embedding
│ ├── memory/ # 4-layer memory (L1-L4)
│ ├── dialogue/ # dialogue agent, temporal conflict formatting
│ ├── metabolism/ # metabolism pipeline
│ │ ├── extractor # knowledge classification + fact promotion
│ │ ├── resolver # contradiction detection + pair preservation
│ │ ├── demoter # 90-day TTL demotion (L3→L4)
│ │ └── garbage # processed log deletion
│ ├── health/ # S-value monitoring, snapshots, auto-rollback
│ ├── scheduler.py # dialogue/metabolism mode switching
│ └── main.py
├── tests/ # pytest suite
├── scripts/
│ └── experiment_runner.py # controlled experiment runner
├── config/ # experiment configurations (8 models)
└── docs/ # analysis reports, pitch materials
- Survival Equation: S = μ × e^(-δ × SCALE_FACTOR) — survival potential under cumulative contradiction δ
- Paper 3: "Cognitive Sleep for LLMs" — this system's experimental validation
- Paper 1: delta-survival-papers — survival equation S = μ × e^{-δ} (Lean 4 formal proofs,
sorry = 0,axiom = 0) - OSF Project: osf.io/mdh7b — all papers, data, and code in one place
- Key insight: Context rot is caused by contradiction accumulation, not context length
- delta-prune — Lightweight middleware version: scan and clean contradictions before sending to any LLM API
- delta-survival-papers — Survival Equation paper and Lean 4 formal proofs
Apache License 2.0