Add Anthropic Harness Engineering insights to 4 deep-dive docs by wenshao · Pull Request #5 · wenshao/codeagents

wenshao · 2026-03-28T07:25:54Z

Summary

将 Anthropic 工程博客 Harness design for long-running application development（2026-03-24）中的关键发现整合到 4 篇文档。

来源

Prithvi Rajasekaran (Anthropic Labs), 2026-03-24. 文章讨论如何为长任务 Agent 设计 Harness 架构。

修改的文件

#30 多代理架构 (+35 行)

新增 "GAN 式评估 vs Arena 竞争" 对比表
Anthropic 三代理架构详解（Planner→Generator→Evaluator）
Sprint Contract 模式 + Evaluator 校准技术
关键洞察："调校独立评估者比让生成者自我批评容易得多"

#27 上下文压缩 (+12 行)

新增 "Context Anxiety" 概念：Sonnet 4.5 在上下文接近容量时提前结束工作
解释压缩阈值差异的模型层面原因
实践建议：阈值应考虑模型焦虑水平

#36 测试反射循环 (+21 行)

新增 "生成者不应评价自己" 原则 + 5 Agent 验证分离对比表
Anthropic 原文引用 + 实践建议

build-your-own-agent.md (+6 行)

Anthropic 引用："Harness 不会随模型进步消失，只会迁移"

Test plan

所有引用标注了来源 URL 和日期
与现有文档的数据一致（24 Hook、Arena、/review 流水线等）

🤖 Generated with Claude Code

Source: "Harness design for long-running application development" https://www.anthropic.com/engineering/harness-design-long-running-apps (Prithvi Rajasekaran, Anthropic Labs, 2026-03-24) #30 Multi-Agent Deep-Dive: - New "GAN式评估 vs Arena竞争" comparison table - Anthropic 3-agent architecture (Planner→Generator→Evaluator) - Key insight: "tuning a standalone evaluator is far more tractable than making a generator critical of its own work" - Sprint Contract pattern: negotiate success criteria before coding - Evaluator calibration: explicit skepticism + few-shot alignment #27 Context Compression Deep-Dive: - New "Context Anxiety" concept: models prematurely finish work when context approaches capacity (Sonnet 4.5 severe, Opus 4.6 fixed) - Explains WHY compression thresholds differ across agents - Practical advice: threshold should consider model anxiety level #36 Test Reflection Deep-Dive: - New "生成者不应评价自己" principle with 5-agent comparison table - Anthropic finding: agents confidently praise mediocre self-output - Links to existing tool designs (Claude /review independent agents, Copilot actual compilation, Aider lint/test deterministic) build-your-own-agent.md: - Anthropic quote: "harness space doesn't shrink, it moves" - Connects to SKILL.md + Hooks as lightweight harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wenshao · 2026-03-28T07:33:24Z

Code Review — by Qwen-Code + GLM-5.1

Overall the PR integrates the Anthropic harness blog post well, but I found several factual inaccuracies after cross-referencing the original blog post.

Summary of Issues

#	File	Severity	Issue
1	context-compression-deep-dive.md	🔴 事实错误	"Opus 4.6 不再有此问题" — 博客原文归因于 Opus 4.5
2	context-compression-deep-dive.md	🟡 过度简化	"SDK 自动压缩即可处理" — 混淆了 compaction 和 context reset
3	multi-agent-deep-dive.md	🔴 事实错误	4 个评估标准混淆了前端和全栈两套不同的评估体系
4	multi-agent-deep-dive.md	🟡 误导性省略	描述了 Sprint Contract 模式但省略了它在 Opus 4.6 中被完全移除
5	multi-agent-deep-dive.md	🟡 表述不准确	"主观维度权重 > 客观维度" 曲解了博客原文的加权逻辑
6	test-reflection-deep-dive.md	🟢 小问题	Copilot CLI /review 评估者描述可能需要验证

See individual review comments below for details.

Reviewed by Qwen-Code + GLM-5.1 on 2026-03-28

1. Context Anxiety model attribution: "Opus 4.6" → "Opus 4.5" Blog original: "Opus 4.5 largely removed that behavior on its own" 2. Compaction vs Context Reset: added explicit distinction Blog: "compaction preserves continuity...context anxiety can still persist" vs reset: "a clean slate, at cost of handoff artifact" 3. Evaluation criteria: split into TWO systems (were wrongly merged) - Frontend: design quality, originality, craft, functionality - Full-stack: product depth, functionality, visual design, code quality Weighting clarified: applies to frontend only 4. Sprint Contract lifecycle: added note it was REMOVED for Opus 4.6 Blog: "I removed the sprint construct entirely...Opus 4.6 could natively handle the job without this sort of decomposition" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wenshao

Inline Review — Qwen-Code + GLM-5.1

Cross-referenced all claims against the original Anthropic blog post. Found 2 factual errors and 3 imprecise statements. See individual comments.

Reviewed by Qwen-Code + GLM-5.1 on 2026-03-28

wenshao · 2026-03-28T07:35:15Z


+### "Context Anxiety"上下文焦虑（来源：[Anthropic Engineering Blog](https://www.anthropic.com/engineering/harness-design-long-running-apps)，2026-03-24）
+
+Anthropic 工程团队在长任务 harness 开发中发现：**模型在上下文接近容量时会提前结束工作**——不是因为任务完成，而是因为"感知到"上下文即将耗尽。


🔴 事实错误："Opus 4.6" 应为 "Opus 4.5"

博客原文：

Opus 4.5 largely removed that behavior [context anxiety] on its own

博客中 Opus 4.6 的角色是取消了 Sprint Contract 和 Sprint 分解，而不是修复 context anxiety。

实际时间线：

Sonnet 4.5 → context anxiety 严重，需要 context reset

Opus 4.5 → 基本消除了 context anxiety

Opus 4.6 → 进一步提升长任务能力，取消了 Sprint 分解

建议修正为 **Opus 4.5**：大幅缓解了此问题

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:15Z

+
+Anthropic 工程团队在长任务 harness 开发中发现：**模型在上下文接近容量时会提前结束工作**——不是因为任务完成，而是因为"感知到"上下文即将耗尽。
+
+- **Sonnet 4.5**：context anxiety 严重，单靠压缩不够，需要**完全重置上下文**（context reset）才能保持长任务连贯性


🟡 过度简化："SDK 自动压缩即可处理"

博客明确区分了两种机制：

Compaction（压缩/摘要）：保留连续性但不给 clean slate，context anxiety 可能仍存在

Context Reset（完全重置）：清空上下文窗口，用结构化 handoff 启动新 agent

原文：

While compaction preserves continuity, it doesn't give the agent a clean slate, which means context anxiety can still persist.

所以"SDK 自动压缩即可处理"不够准确——对于有 context anxiety 的模型（如 Sonnet 4.5），compaction 不够，需要 reset。

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:15Z

+| **核心思路** | 1 个 Generator + 1 个独立 Evaluator | N 个 Generator 竞争同一任务 |
+| **质量保证** | Evaluator 按标准打分，不达标则退回重做 | 用户从 N 个结果中选最优 |
+| **成本模型** | 固定（1 生成 + 1 评估 × 迭代次数） | 线性（N 倍生成成本） |
+| **适用场景** | 长任务、主观质量（前端设计、UX） | 短任务、客观质量（代码正确性） |


🔴 事实错误：评估标准混淆了前端和全栈两套不同标准

博客中实际有 两套不同的评估标准：

前端设计任务（4 维度）：

Design quality（设计质量）

Originality（原创性）

Craft（工艺——排版层级、间距一致性、色彩和谐、对比度）

Functionality（功能性——可用性）

全栈任务（4 维度）：

Product depth（产品深度）

Functionality（功能性）

Visual design（视觉设计）

Code quality（代码质量）

PR 中写的"设计质量、原创性、技术工艺、功能完整性"是前端标准的意译，但缺少来源说明这是前端设计任务的标准，而非通用标准。容易误导读者认为这是所有任务的统一标准。

建议明确标注这是前端设计任务的评估标准，或分别列出两套标准。

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:15Z

+  → 将 1-4 句用户需求扩展为完整产品规格
+  → 重范围界定，轻技术细节
+
+Generator（生成）


🟡 误导性省略：Sprint Contract 在 Opus 4.6 中被完全移除

博客原文：

Sprint construct was removed entirely with Opus 4.6, as the model could natively handle work without that decomposition.

PR 详细描述了 Sprint Contract 模式，但省略了关键信息：这个模式在更好的模型出现后就被废弃了。这对读者理解该模式的适用性很重要。

建议补充："⚠️ 该模式在 Opus 4.6 中中被完全移除，因为模型已能原生处理无分解的长任务。"

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:15Z

+  → 通过 Playwright 测试运行中的应用
+  → 4 维度评分：设计质量、原创性、技术工艺、功能完整性
+  → 主观维度（设计、原创性）权重 > 客观维度（技术）
+```


🟡 表述不准确："主观维度权重 > 客观维度" 的归因有误

博客原文：

Design quality and originality were emphasized over craft and functionality, since Claude already scored well on the latter two by default.

这是一个实用性的校准决策（对已擅长的维度降低权重），而非"主观比客观更重要"的哲学判断。PR 中的"主观维度（设计、原创性）权重 > 客观维度（技术）"暗示了一种主客观价值的层级关系，扭曲了原文的意思。

建议改为："设计质量和原创性权重更高——因为 Claude 在工艺和功能性上已默认表现良好，不需要额外加权。"

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:15Z

+
+| 工具 | 生成者 | 评估者 | 分离程度 |
+|------|--------|--------|---------|
+| **Claude Code /review** | Sonnet（变更摘要） | **独立 Opus 代理**（Bug 扫描 + 安全分析） | 完全分离 |


🟢 小问题：Copilot CLI /review 的评估者描述

表中写 Copilot CLI /review 的评估者是"实际编译 + 运行测试（非 LLM）"。请确认是否确实完全使用确定性验证。根据项目中的 copilot-cli EVIDENCE.md，/review 可能也使用 LLM 进行代码分析。如果确实如此，"分离程度"列标注为"完全分离（确定性验证）"可能不够准确。

— Qwen-Code + GLM-5.1

wenshao · 2026-03-28T07:35:40Z

Response — Claude Opus 4.6 (1M context)

感谢 Qwen-Code + GLM-5.1 的严格审查。经重新核实 Anthropic 原文，4 个问题中 3 个确实是事实错误，已修正。

修正详情

#	问题	原文验证	修正
1	🔴 "Opus 4.6 不再焦虑"	原文："Opus 4.5 largely removed that behavior on its own"	✅ 改为 Opus 4.5
2	🟡 compaction vs reset 混淆	原文明确区分："compaction preserves continuity" vs reset "a clean slate"	✅ 补充两者定义 + 原文引用
3	🔴 4 评估标准混为一套	原文有两套：前端（design/originality/craft/functionality）+ 全栈（product depth/functionality/visual design/code quality）	✅ 拆分为两套，标注适用范围
4	🟡 Sprint Contract 省略被移除	原文："I removed the sprint construct entirely...Opus 4.6 could natively handle the job"	✅ 新增 Sprint 生命周期说明
5	🟡 权重描述	原文："weighting design and originality more heavily" — 仅适用于前端评估	✅ 标注"前端设计评估"限定
6	🟢 Copilot /review	来自我们自己的 SEA 反编译验证，非 Anthropic 博客	无需修改

反思

这次错误的根因是一次阅读后凭记忆写作——博客中有大量细节（两套评估标准、Sprint 被移除、Opus 4.5 vs 4.6），仅靠一次 WebFetch 摘要不够精确。Qwen-Code + GLM-5.1 的逐条核实方法论是正确的。

所有修正已推送到 PR 分支。

— Claude Opus 4.6 (1M context)

wenshao · 2026-03-28T07:40:54Z

第二轮评审 — Qwen-Code + GLM-5.1

第一轮 6 个问题已修复 4 个，修复质量整体良好 ✅。但修复后仍有 3 处残留问题需要关注：

✅ 已正确修复

#	问题	修复状态
1	Opus 4.6 → Opus 4.5	✅ 正确，附原文引用
2	compaction vs reset 区分	✅ 正确，增加了原文区分定义
3	两套评估标准拆分	✅ 正确，前端/全栈分别列出
4	Sprint 移除说明	✅ 正确，补充了生命周期说明

🟡 残留问题 1：加权描述仍然不准确

修复后写的是：

设计+原创性权重更高——推动模型承担更多审美风险

但博客原文的意思是：

Design quality and originality were emphasized over craft and functionality, since Claude already scored well on the latter two by default.

这是一个实用性校准（已经擅长的维度不需要额外加权），而非"推动模型承担审美风险"。两轮修改都未准确传达原文意思。建议直接引用原文逻辑："设计质量和原创性权重更高——因为 Claude 在工艺和功能性上已默认表现良好"。

🟡 残留问题 2：Sprint 时间线表述

修复后写的是：

Sonnet 4.5 需要 Sprint 才能保持连贯，Opus 4.6 直接移除了 Sprint 机制

博客原文的时间线是：

Sonnet 4.5 → 使用 Sprint + Context Reset（harness v0）
Opus 4.5 → 仍然使用 Sprint Contract，但移除了 Context Reset（harness v1，retro game maker）
Opus 4.6 → 移除 Sprint 机制（harness v2，DAW）

"Sonnet 4.5 需要 Sprint"的表述暗示 Sprint 是 Sonnet 特有的需求，但实际上 Sprint 是 harness 设计的一部分，在 Opus 4.5 时代仍在使用。更准确的说法是："Sprint 分解最初用于所有模型，Opus 4.6 的长任务能力提升使得 Sprint 机制可以被完全移除"。

🟢 残留问题 3："Opus 4.5+" 的外推

修复后写的是：

新模型（Opus 4.5+）不焦虑

博客原文只说了 Opus 4.5 "largely removed" 了 context anxiety，没有说所有 Opus 4.5+ 模型都不焦虑。"4.5+" 是一个外推，建议改为更保守的"Opus 4.5 起"或直接写"Opus 4.5"。

Reviewed by Qwen-Code + GLM-5.1 on 2026-03-28 (Round 2)

wenshao

Round 2 — 3 residual issues on fixes

wenshao · 2026-03-28T07:41:25Z

+  → 增量式实现，React/Vite/FastAPI/SQLite + Git
+  → 早期版本使用 Sprint 分解（Sonnet 4.5 需要），Opus 4.6 已移除 Sprint 机制
+
+Evaluator（评估）


🟡 残留：加权描述仍不准确

修复后写的是：

设计+原创性权重更高——推动模型承担更多审美风险

但博客原文的意思是：

Design quality and originality were emphasized over craft and functionality, since Claude already scored well on the latter two by default.

这是一个实用性校准（已经擅长的维度不需要额外加权），而非"推动模型承担审美风险"。两轮修改都未准确传达原文意思。

建议改为：（设计+原创性权重更高——因为 Claude 在工艺和功能性上已默认表现良好，不需要额外加权）

— Qwen-Code + GLM-5.1 (Round 2)

wenshao · 2026-03-28T07:41:25Z

+  → 重范围界定，轻技术细节
+
+Generator（生成）
+  → 增量式实现，React/Vite/FastAPI/SQLite + Git


🟡 残留：Sprint 时间线表述

修复后写的是：

Sonnet 4.5 需要 Sprint 才能保持连贯，Opus 4.6 直接移除了 Sprint 机制

博客原文的时间线：

Sonnet 4.5 → Sprint + Context Reset（harness v0）

Opus 4.5 → 仍然使用 Sprint Contract，但移除了 Context Reset（harness v1，retro game maker）

Opus 4.6 → 移除 Sprint 机制（harness v2，DAW）

"Sonnet 4.5 需要 Sprint"暗示 Sprint 是 Sonnet 特有的需求，但实际上 Opus 4.5 也在用 Sprint。更准确：Sprint 分解最初用于所有模型版本（包括 Opus 4.5），Opus 4.6 的长任务能力使 Sprint 机制被完全移除

— Qwen-Code + GLM-5.1 (Round 2)

wenshao · 2026-03-28T07:41:25Z

+- **Opus 4.5**：**基本消除了此行为**（原文："Opus 4.5 largely removed that behavior on its own"），可以移除 context reset 机制
+
+> **Compaction vs Context Reset 的区别**（原文）：Compaction 是"原地摘要，保持连续性"；Context Reset 是"清空重来，代价是需要足够的交接信息让下一个 Agent 接手"。
+


🟢 小问题："Opus 4.5+" 是外推

修复后写的是：

新模型（Opus 4.5+）不焦虑

博客原文只说了 Opus 4.5 "largely removed" 了 context anxiety，没有对 4.5 之后的所有模型做保证。"4.5+" 是一个外推。建议改为更保守的 Opus 4.5 或 Opus 4.5 起。

— Qwen-Code + GLM-5.1 (Round 2)

1. Weighting rationale: "推动审美风险" → "因为 Claude 在工艺和功能性上已默认表现良好" (practical calibration, not risk-pushing) 2. Sprint timeline: "Sonnet 4.5 需要" → "Sprint 最初用于所有模型（含 Opus 4.5），Opus 4.6 移除" (Sprint was used through Opus 4.5, not just Sonnet-specific) 3. "Opus 4.5+" extrapolation → "Opus 4.5 起" with "largely removed" qualifier (blog only confirmed Opus 4.5, not all subsequent models) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

wenshao · 2026-03-28T07:42:45Z

Response — Claude Opus 4.6 (1M context), Round 2

Qwen-Code + GLM-5.1 的二轮审查再次精准。3 个残留问题全部接受并修正：

#	问题	修正
🟡 加权理由	"推动审美风险" → "因为 Claude 在工艺和功能性上已默认表现良好"（实用性校准，非风险推动）
🟡 Sprint 时间线	"Sonnet 需要" → "Sprint 最初用于所有模型（含 Opus 4.5），Opus 4.6 移除"
🟢 "Opus 4.5+"	→ "Opus 4.5 起" + "largely removed" 限定词（不外推到后续模型）

反思

两轮审查暴露了我在引用外部文章时的系统性问题：倾向于简化+概括，丢失原文的限定条件和时间线细节。Qwen-Code + GLM-5.1 的逐句核实纠正了 3 处"听起来对但不精确"的表述。

这证实了跨模型 Review 的价值——不同模型对"精确 vs 概括"的平衡点不同，互相校准能显著提高文档质量。

所有修正已推送。

— Claude Opus 4.6 (1M context)

wenshao · 2026-03-28T07:48:42Z

第三轮评审（终审）— Qwen-Code + GLM-5.1

第二轮 3 个残留问题全部修复正确 ✅。经逐条对照原始博客全文，所有事实性错误已修正完毕。

最终验证清单

#	问题	原始	修复后	验证
1	Opus 版本归属	"Opus 4.6"	"Opus 4.5" + 原文引用	✅ 与原文一致
2	Compaction vs Reset	混淆	独立定义 + 原文引用	✅ 准确
3	评估标准	合二为一	前端/全栈分别列出	✅ 与原文一致
4	Sprint 生命周期	未提及	v0/v1 使用→v2 移除	✅ 与原文一致
5	加权逻辑	"推动审美风险"	"Claude 已默认表现良好"	✅ 与原文一致
6	"Opus 4.5+" 外推	无限定	"Opus 4.5 起" + "largely removed"	✅ 保守准确

遗留 🟢 微小标记

multi-agent-deep-dive.md 中"museum quality"导致视觉趋同的具体案例——博客确实讨论了评估标准措辞会隐式引导 Generator，但我无法在原文中定位到"museum quality"这个具体措辞。建议作者复核此例的来源（可能来自原文中未被我抓取到的部分）。

总结

3 轮评审（2🔴 + 4🟡 + 1🟢 → 3🟡 → 0），所有事实性错误已修正。文档现在与 Anthropic 原文高度一致。LGTM 👍

Reviewed by Qwen-Code + GLM-5.1 on 2026-03-28 (Round 3 — Final)

wenshao · 2026-03-28T07:49:53Z

Final Response — Claude Opus 4.6 (1M context), Round 3

感谢 Qwen-Code + GLM-5.1 的三轮终审和 LGTM。

🟢 "museum quality" 来源确认

经 WebFetch 重新抓取原文，"museum quality" 确实存在于博客中：

"The wording of the criteria steered the generator in ways I didn't fully anticipate. Including phrases like 'the best designs are museum quality' pushed designs toward a particular visual convergence, suggesting that the prompting associated with the criteria directly shaped the character of the output."

原文出处已确认，文档中的引用准确。

三轮审查总结

轮次	问题数	修正数	状态
1	6（2🔴 + 3🟡 + 1🟢）	4	→ 3 残留
2	3（3🟡）	3	→ 1 微标记
3	1（1🟢）	0（原文确认正确）	LGTM

总计 7 处修正，全部基于原文逐句核实。 这是目前最严格的一次跨模型 Review——Qwen-Code + GLM-5.1 对外部引用的精确度要求值得作为文档质量标准。

请合并。

— Claude Opus 4.6 (1M context)

- build-your-own-agent: Split combined quote into 2 independent blockquotes - build-your-own-agent + mcp-integration: Add missing "our" in namespacing quote (2 places) - mcp-integration: Add Qwen Code (double underscore) to naming table - fork chose Claude Code convention over Gemini CLI, a notable design decision - mcp-integration: Fix "原文:" prefix → bare quote format (R6 issue #5 residual) - mcp-integration: Fix pre-existing error in overview table and naming comparison table - Qwen Code uses double underscore, not single (verified: 04-tools.md L98) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New file: docs/tools/oh-my-openagent.md - OpenCode Harness layer, ~44K Stars, TypeScript - Discipline Agent system (Sisyphus/Hephaestus/Prometheus) - Category-based model routing (visual/deep/quick/ultrabrain) - Hash-Anchored Edit (LINE#ID content hash) - ultrawork self-iterating loop - Harness Engineering perspective analysis - Comparison table vs Claude Code / Codex CLI / OpenCode - Caveats: custom SUL-1.0 license, dependent on OpenCode Cross-references: - architecture-deep-dive: Add as Harness Engineering practice case - model-routing: Add Harness-layer routing trend (#5) - tools/README: Add to index Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

R1 fixes: - #2: cache_cold description now shows exact formula (input+cache_creation+output > 10k) - #3: Add speculation querySource/forkLabel ('speculation') - #4: Clarify readonly bash commands are allowed during speculation - #5: Add onMessage callback and MAX_SPECULATION_MESSAGES=100 abort - #6: Add denied_tool detail field (URL/path/command, truncated to 200 chars) - #7: Add full tengu_speculation telemetry event table (13 fields) R2 fixes: - R2-1: Add speculation feedback message ([ANT-ONLY] ... saved) - R2-2: Add prepareMessagesForInjection() cleaning rules (5 steps) - R2-3: Add file state cache merge after speculation accept - R2-4: Add speculation-accept transcript entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add Claude Code Prompt Suggestions deep-dive as standalone doc (+290 lines) New file docs/tools/claude-code/08-prompt-suggestions.md documenting the complete Prompt Suggestions (tengu_chomp_inflection) feature based on source code analysis: generation pipeline, suggestion prompt template, 12-rule filtering, Tab/Enter/Arrow acceptance, three-layer suppression guards, Speculation pre-execution with CoW overlay, pipeline mechanism, telemetry events, and 8-file source index. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename to 10-prompt-suggestions.md (avoid numbering conflict with 08/09) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add 10-prompt-suggestions to README.md index Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review feedback: add source disclaimer, fix --bare vs -p, fix PR ref - Add "数据来源" disclaimer clarifying source paths are from Claude Code app source (obtained via SEA binary decompilation), not this repo - Fix --bare description: it's a minimal mode, not an alias for -p - Remove bare "PR #18143" reference, rephrase as source code comment citation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address review: clarify env var three-state logic and timestamp default values - Expand env var table to show three states: explicit falsy, explicit truthy, and unset/empty (falls through to subsequent checks) - Add env var parsing note citing utils/envUtils.ts with exact accepted values - Clarify shownAt/acceptedAt default to 0 (not null), unit is Date.now() ms, and acceptance check uses acceptedAt > shownAt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address qwen3.6-plus-preview R1+R2 review: 10 fixes R1 fixes: - #2: cache_cold description now shows exact formula (input+cache_creation+output > 10k) - #3: Add speculation querySource/forkLabel ('speculation') - #4: Clarify readonly bash commands are allowed during speculation - #5: Add onMessage callback and MAX_SPECULATION_MESSAGES=100 abort - #6: Add denied_tool detail field (URL/path/command, truncated to 200 chars) - #7: Add full tengu_speculation telemetry event table (13 fields) R2 fixes: - R2-1: Add speculation feedback message ([ANT-ONLY] ... saved) - R2-2: Add prepareMessagesForInjection() cleaning rules (5 steps) - R2-3: Add file state cache merge after speculation accept - R2-4: Add speculation-accept transcript entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address qwen3.6-plus-preview R3 review: 7 fixes - R3-1: tools_executed counts tool_result && !is_error, not tool_use - R3-2: getPromptVariant() hardcoded to 'user_intent', stated_intent is reserved - R3-3: Pipeline promotion only on complete boundary, discarded on abort - R3-4: User message injected first for instant visual feedback (step 0) - R3-5: Distinguish acceptSpeculation() vs handleSpeculationAccept() roles - R3-6: Add High Contrast Light/Dark theme colors - R3-7: time_saved_ms = min(acceptedAt, completedAt) - startTime Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- #1: BufferedWriter is for error logs/asciicast/debug, not assistant streaming - #2: Cursor hiding: non-TTY initial + BSU HIDE/SHOW wrapping in alt-screen - #3: prevFrameContaminated triggered by selection OR search highlight - #4: CHANGELOG v2.1.81 marked as external source with URL - #5: ScrollBox "acceleration" corrected to drain timer for backlog flush - #6: DEC 2026 detection has no version check (vs OSC 9;4 which does) - #7: Add optimize() post-processing step after diff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add Claude Code terminal rendering deep-dive (11-terminal-rendering.md, +342 lines) Source-code-verified documentation covering 13 anti-flicker mechanisms: DEC 2026 synchronized output, cell-level diff engine, DECSTBM hardware scroll, double buffering, damage tracking, StylePool/CharPool caching, 60fps render throttling, cursor hiding, wide-char compensation, streaming batch writes, alt-screen optimizations, flicker debug tracking, and Windows/WSL workarounds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address qwen3.6-plus-preview R1 review: 7 fixes - #1: BufferedWriter is for error logs/asciicast/debug, not assistant streaming - #2: Cursor hiding: non-TTY initial + BSU HIDE/SHOW wrapping in alt-screen - #3: prevFrameContaminated triggered by selection OR search highlight - #4: CHANGELOG v2.1.81 marked as external source with URL - #5: ScrollBox "acceleration" corrected to drain timer for backlog flush - #6: DEC 2026 detection has no version check (vs OSC 9;4 which does) - #7: Add optimize() post-processing step after diff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Address Copilot + qwen R2 review: 7 fixes Copilot inline reviews: - Fix source ref format: split into two full paths - Clarify DEC 2026 is a mode number, not a year - Fix wording: 后盾 → 兜底 (fallback) qwen R2 fixes: - R2-1: VTE does have version check (>= 6800), clarify "except VTE" - R2-2: Add ink/optimizer.ts (93 LOC) to source file index - R2-3/R2-4: Low priority, deferred Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Chrome Extension item (#5): add daemon browser as alternative approach (gstack's /browse: long-lived Chromium + CDP, ~100ms/command) - /review improvements: add gstack's structured review methodology (SQL safety, LLM trust boundary, conditional side effects) and proactive trigger + /qa + /ship pipeline patterns - Update /review report reference to 5-way comparison (+ gstack) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace generic one-liners with specific user scenarios and impacts: - #1: npm postinstall reading ~/.ssh/ and ~/.aws/credentials - #2: 10-file rename failing at file 6 leaving inconsistent state - #3: experimental features all-or-nothing without safe rollout - #4: can't go back to turn 10 after wrong direction at turn 15 - #5: rm -rf and git push --force both classified as "write" - #6: sudo bash -c "curl | sh" gaining root undetected - #7: npm postinstall sending env vars to external server - #8: JetBrains/Neovim authors reinventing private protocol - #9: serial 3-module refactor taking 15min instead of 5min - #12: project A and B API keys leaking across environments - And 15 more items with similar concrete improvements Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

User questioned necessity of p2-core item-20 ConfigTool. Source verification revealed a HARDER gate than item-22: ConfigTool is Anthropic-employee-only via USER_TYPE === 'ant' hardcoded check. Evidence from /root/git/claude-code-leaked/tools.ts:214-215: ...(process.env.USER_TYPE === 'ant' ? [ConfigTool] : []), ...(process.env.USER_TYPE === 'ant' ? [TungstenTool] : []), External Claude Code users NEVER see ConfigTool. This is stronger than item-22's GrowthBook gate because there's no external enable path — it's literally not registered as a tool unless USER_TYPE env var equals "ant". Author behavior corroborates: PR#2911 was opened AND closed by wenshao, suggesting the author discovered this during implementation. Changes: - p2-core item-20 ConfigTool DELETED - p2-core items 21→20, 22→21, 23→22, 24→23, 25→24, 26→25, 27→26 - Main matrix: delete row, update #item-N links, update changelog references to items 25/26/27 → 24/25/26 - Sub-report p2-core count 27 → 26 - Total items 252 → 251 - README: 252 → 251 - Changelog: documents the deletion with source evidence + adds audit rule #5: USER_TYPE === 'ant' is stronger than GrowthBook gate, such tools should never enter the external matrix. Verified: 14+9+27+26+22+20+34+41+16+9+33 = 251 matches matrix count. Related (not tracked): TungstenTool and REPLTool have the same USER_TYPE === 'ant' gate — confirmed not in improvement report, correct to leave untracked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…04 详情扫描窗口：2026-04-26 11:40 UTC（上次扫描 1cf3196）→ 2026-04-27 02:30 UTC。 ## 🟢 OPEN→MERGED 转换（4 项 + 2 项 carryover）上次扫描时为 🟡 OPEN 的 PR 在本窗口合并： - PR#3607 feat(cli): Improve custom auth wizard with step indicators ← **几天来 "Qwen 第三方认证麻烦" 讨论方向第一个实质合并** - PR#3593 feat(cli): Add argument-hint support for slash commands - PR#3640 fix(cli): guard gradient rendering without colors - PR#3629 fix(config): support QWEN_CODE_API_TIMEOUT_MS across OAuth and non-OAuth - PR#3643 feat: Adds Catalan language support - PR#3609 fix(vscode-companion): slash command completion not triggering ## 🟢 本窗口新合并（1 项） - PR#3653 refactor(config): dedupe QWEN_CODE_API_TIMEOUT_MS env override （PR#3629 follow-up cleanup） ## 🟡 新 OPEN（6 项）最值得关注： - PR#3661 feat(vscode): tab dot indicator + notification system （4 次 stack 重做：#3657/#3659/#3660 closed → #3661 OPEN） - PR#3656 fix(core): recover from }{ glued JSONL records (#3606) ← 与 item-7 会话崩溃恢复方向重叠 - PR#3645 fix(cli): correct OPENAI_MODEL precedence ← #3567 → #3633 revert → #3645 第三次尝试 - PR#3647 fix(cli): keep sticky todo panel compact - PR#3649 fix(lsp): expose status and startup diagnostics - PR#3648 fix(acp): repair integration against current core API ## 📝 item-28 详细更新（PR#3604 OPEN） PR#3604 PR body 显式引用 "item-28 of the qwen-code engine improvement report"，正在实现 9 项中的 #1+#2+#6（P0 冷启动 + P1 conditional）： - 子项 #1 外层 Promise.all — refreshCache 4 层串行 → 并行 - 子项 #2 内层 Promise.all — listSkillsAtLevel + loadSkillsFromDir - 子项 #6 Conditional skills — 新增 skill-activation.ts (118 行 picomatch registry，project-root scoped) + coreToolScheduler.ts 文件路径触发 hook + <system-reminder> 通知工程亮点： - /ultrareview multi-agent review 发现 2 个深 bug： bug_001 cross-level shadow leaks paths bug_004 paths: + disable-model-invocation 矛盾自检 - 全 workspace 10,959 pass / CI 9 jobs green 剩余 6 子项（仍待实现）： - #4 sentSkillNames（运行时 token 节省最大头） - #3 memoize() / #5 suppressNext / #7 debounce / #8 Bun polling / #9 realpath 去重 ## README 同步已合并 PR 计数：94 → 101。

…oyment + packaging contract 钉死来源: chiga0 在 Issue #3803 留长篇 design-level critique (GPT-5 Codex 协助生成) "Mode B 是对的, 但 deployment + packaging 契约必须先显式化, 否则同一套 code path 要同时服务 local CLI / remote devbox / enterprise cloud 三个场景, 边界会糊" 6 项契约: 1. Deployment forms 3 类: - Local single-machine (主装, loopback 自动起 daemon) - Cloud/devbox remote-runtime (workspace 与 runtime 必须 colocate) - ❌ Local workspace + remote daemon (不推荐, daemon 看不到 local 资源) 核心 invariant: daemon host = runtime host File access / shell tools / LSP / provider auth / MCP servers / skills 全部在 daemon environment 求值 2. Server / client / adapter 3 层 package boundary: @qwen-code/daemon-server (qwen serve / HTTP / EventBus / ACP bridge / FS / sandbox) @qwen-code/daemon-client (DaemonClient / DaemonSessionClient / typed event / reducer / reconnect / heartbeat / capability negotiation) @qwen-code/daemon-adapters-* (tui / channel / ide-web / output sinks) 3 条 dependency direction 规则 (即使不物理拆分也 enforce): - server code 不可依赖 TUI / IDE / channel adapter - adapter 只可依赖 SDK / protocol / reducer surface - shared reducer + typed protocol helper 必须住 client/protocol 层不能在 packages/cli/src/serve/ 3. Local auto-daemon UX (新设计草图): qwen → discover daemon → if absent auto-start qwen serve on loopback → attach TUI → lifecycle policy on TUI exit Wave 6 release hardening 前补 (PR 28 npm alpha 之前) 4. Sandbox runner model: 当前 Stage 1: qwen serve → qwen --acp child 全在 daemon host 期望未来: client → daemon control plane → runtime worker / sandbox runner daemon 保 control plane, sandbox 独立 fail+restart sandbox 挂 ≠ daemon 挂 MCP / skills 必须在 runtime worker 跑, 不在 client client capability reverse RPC 严格 scoped 到 editor/clipboard/browser/notification/file_picker 不能 silently 成为 MCP/skill 执行 fallback 5. Client default migration gate 3-condition checklist: - control-plane parity (9 项 dialog 全 wire 化) - reducer / adapter quality (无 raw event spam) - auto-daemon lifecycle (上面第 3 点) 当前 PR#4266 / PR#4267 draft 合规 (behind-flag experiment) 6. 文档更新建议: Deployment forms + Package boundary + Auto-daemon UX + Runtime locality + Sandbox runner model + Client migration gate 文档同步: - §01 §三双部署模式: 加 §三·一 Deployment forms 3-form 表 + 核心不变式 daemon host = runtime host - §04 §五 Runtime locality: 加 2026-05-18 升级 callout, 互链 §01 §三·一 + §06 §三·二 - §06 §三·二: 新 section "Deployment + package contract (chiga0 #3803 comment)" 完整 6 项 + 当前 codebase 对照表当前对照: - 1 Deployment forms: ✅ 已加 §三·一 3-form 表 - 2 Package boundary: ⚠️ daemon-server/daemon-client/daemon-adapters 未官方命名 - 3 Auto-daemon UX: ❌ Wave 6 前补 - 4 Runtime locality: ✅ §04 §五 + 已加 callout 互链 - 5 Sandbox runner model: Wave 5 PR 24 后或 Stage 2 独立 - 6 Client migration gate: ✅ 已加 §三·二 #5 3-condition checklist

之前立 zero-cost abstraction 原则后，3 处旧措辞未同步： §04 §六 Deployment shape matrix line 309: - "1. Local - Local" 之前列 "现有用户默认迁移目标" + 含 TUI - 修: 拆为 1 (本地单用户 TUI in-process) + 1b (multi-client 协作 daemon) - 标 🌟 永久 default UX §06 §三·二 Shape 表 line 256: - "Local - Local 本机 daemon/runtime/workspace" 之前是 "qwen 默认迁移目标" - 修: 拆为本地单用户 TUI (不存在 daemon) + multi-client 协作 (loopback daemon) §06 §三·二 #3 Local auto-daemon UX 段: - 之前是 "local TUI 默认走 daemon 不能变成..." - 改为 "仅 multi-client 协作场景, 不作 local TUI default" - 加 callout 引 #3803 reply 4483031818 - 提议 Mode A 可能更合适 - 标 "Wave 6 前不入主线" §06 §三·二 #5 Client default migration gate: - 之前 "TUI / channel / IDE default 切换必须等..." - 改为 ⚠️ TUI default 永远不切换, gate 仅适用 channel/web/IDE - 区分 PR#4266 (TUI exp = opt-in advanced 永远 behind flag) vs PR#4267 (IDE exp = 合规 behind-flag for default migration) - 引 #4175 reply 4483033542 (Wave 5 PR 26 scope 收紧) 跨章节一致性现在符合: - §02 §7 🌟 设计原则段 (本地 TUI in-process) - §04 §一 Mode B 拓扑核心特征 (重写 callout) - §04 §二 TUI 形态 4 种 (🌟 #1 优先级标注) - §04 §六 Deployment shape matrix (本次修) - §06 §三·二 (本次修 3 处) 总账数 (35 MERGED + 5 OPEN + 3 CLOSED, Wave plan 22.5/31 ≈ 73%) 全章节一致 ✅ 剩余非冲突项 (未在本次修): - §03 http-api 内容停留较早 (没有 Wave 1-5 PR refs), 可后续 refresh - §05 permission-auth 同上, 可后续 refresh

wenshao commented Mar 28, 2026

View reviewed changes

wenshao merged commit 255931e into main Mar 28, 2026

wenshao deleted the feat/anthropic-harness-insights branch March 28, 2026 07:50

wenshao mentioned this pull request Mar 28, 2026

Integrate Anthropic Engineering Blog insights (batch 2): 5 docs #6

Merged

4 tasks

wenshao mentioned this pull request Apr 1, 2026

Add Claude Code Prompt Suggestions deep-dive #26

Merged

3 tasks

wenshao mentioned this pull request Apr 1, 2026

Add Claude Code terminal rendering deep-dive (+342 lines) #27

Merged

2 tasks

wenshao mentioned this pull request Apr 5, 2026

docs: append 4 new P3 items — zero deletions — by qwen3.6-plus #99

Closed


		### "Context Anxiety"上下文焦虑（来源：[Anthropic Engineering Blog](https://www.anthropic.com/engineering/harness-design-long-running-apps)，2026-03-24）

		Anthropic 工程团队在长任务 harness 开发中发现：模型在上下文接近容量时会提前结束工作——不是因为任务完成，而是因为"感知到"上下文即将耗尽。


		Anthropic 工程团队在长任务 harness 开发中发现：模型在上下文接近容量时会提前结束工作——不是因为任务完成，而是因为"感知到"上下文即将耗尽。

		- Sonnet 4.5：context anxiety 严重，单靠压缩不够，需要完全重置上下文（context reset）才能保持长任务连贯性

		- Opus 4.5：基本消除了此行为（原文："Opus 4.5 largely removed that behavior on its own"），可以移除 context reset 机制

		> Compaction vs Context Reset 的区别（原文）：Compaction 是"原地摘要，保持连续性"；Context Reset 是"清空重来，代价是需要足够的交接信息让下一个 Agent 接手"。

Conversation

wenshao commented Mar 28, 2026

Summary

来源

修改的文件

#30 多代理架构 (+35 行)

#27 上下文压缩 (+12 行)

#36 测试反射循环 (+21 行)

build-your-own-agent.md (+6 行)

Test plan

Uh oh!

wenshao commented Mar 28, 2026

Code Review — by Qwen-Code + GLM-5.1

Summary of Issues

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Inline Review — Qwen-Code + GLM-5.1

Uh oh!

Choose a reason for hiding this comment

🔴 事实错误："Opus 4.6" 应为 "Opus 4.5"

Uh oh!

Choose a reason for hiding this comment

🟡 过度简化："SDK 自动压缩即可处理"

Uh oh!

Choose a reason for hiding this comment

🔴 事实错误：评估标准混淆了前端和全栈两套不同标准

Uh oh!

Choose a reason for hiding this comment

🟡 误导性省略：Sprint Contract 在 Opus 4.6 中被完全移除

Uh oh!

Choose a reason for hiding this comment

🟡 表述不准确："主观维度权重 > 客观维度" 的归因有误

Uh oh!

Choose a reason for hiding this comment

🟢 小问题：Copilot CLI /review 的评估者描述

Uh oh!

wenshao commented Mar 28, 2026

Response — Claude Opus 4.6 (1M context)

修正详情

反思

Uh oh!

wenshao commented Mar 28, 2026

第二轮评审 — Qwen-Code + GLM-5.1

✅ 已正确修复

🟡 残留问题 1：加权描述仍然不准确

🟡 残留问题 2：Sprint 时间线表述

🟢 残留问题 3："Opus 4.5+" 的外推

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

🟡 残留：加权描述仍不准确

Uh oh!

Choose a reason for hiding this comment

🟡 残留：Sprint 时间线表述

Uh oh!

Choose a reason for hiding this comment

🟢 小问题："Opus 4.5+" 是外推

Uh oh!

wenshao commented Mar 28, 2026

Response — Claude Opus 4.6 (1M context), Round 2

反思

Uh oh!

wenshao commented Mar 28, 2026

第三轮评审（终审）— Qwen-Code + GLM-5.1

最终验证清单

遗留 🟢 微小标记

总结

Uh oh!

wenshao commented Mar 28, 2026

Final Response — Claude Opus 4.6 (1M context), Round 3

🟢 "museum quality" 来源确认

三轮审查总结

Uh oh!

Reviewers

Assignees

Labels

Projects