Skip to content

🐛 fix: recover hetero persistence state across Vercel replicas#14539

Merged
ONLY-yours merged 3 commits into
canaryfrom
fix/hetero-persistence-multi-replica
May 9, 2026
Merged

🐛 fix: recover hetero persistence state across Vercel replicas#14539
ONLY-yours merged 3 commits into
canaryfrom
fix/hetero-persistence-multi-replica

Conversation

@ONLY-yours

Copy link
Copy Markdown
Member

💻 Change Type

  • 🐛 fix

🔗 Related Issue

Related to LOBE-8516 (cloud hetero agent pipeline)

🔀 Description of Change

Fixes message fragmentation and orphaned tool calls caused by Vercel serverless routing the same operation's heteroIngest batches to different Node.js instances.

Root cause: HeterogeneousPersistenceHandler keeps operationStates in process memory. When consecutive batches land on different replicas, the second replica starts with a blank state — wrong currentAssistantMessageId, empty toolMsgIdByCallId, and lost accumulated text.

Three-part fix (no Redis, no new infra):

  1. Flush content after every batchingest() now writes accumulatedContent/accumulatedReasoning to DB at the end of each call, so a replica switch mid-stream doesn't discard buffered text (fixes isolated single-character messages like "我").

  2. Persist heteroCurrentMsgId on step boundarieshandleStepStart now writes the new assistant message id to topic.metadata.heteroCurrentMsgId. A fresh replica reads this field instead of falling back to the stale initial placeholder, so new content lands on the correct message.

  3. Restore toolMsgIdByCallId from DB on state creationloadOrCreateState now queries listMessagePluginsByTopic to rebuild the tool call → tool message mapping, so tool_result events on a new replica can still find and update their counterpart rows.

Desktop unaffected: HeterogeneousPersistenceHandler is only invoked via heteroIngest/heteroFinish tRPC. Desktop hetero agent uses the renderer-side IPC path and never touches this handler.

🧪 How to Test

  • Deploy to Vercel (multi-instance) and run a multi-step Claude Code task
  • Verify no isolated text fragments or orphaned tool call warnings in the conversation

📝 Additional Information

The in-memory singleton remains for same-replica fast-path. The DB reads in loadOrCreateState are only triggered on a cold state (first batch per replica per operation), so the overhead is bounded to one listMessagePluginsByTopic query per replica cold-start.

🤖 Generated with Claude Code

Three-part fix for multi-replica split-brain on Vercel serverless:

1. Flush accumulated content to DB after every ingest batch so a
   replica switch mid-accumulation doesn't lose text chunks.
2. Persist `heteroCurrentMsgId` to topic.metadata on every step
   boundary so new replicas restore the correct currentAssistantMessageId.
3. Restore toolMsgIdByCallId from DB on state creation so tool_results
   landing on a different replica than their tool_use are still matched.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vercel

vercel Bot commented May 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment May 9, 2026 5:09am

Request Review

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 8, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 023750e072

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/server/services/heterogeneousAgent/HeterogeneousPersistenceHandler.ts Outdated
Comment thread src/server/services/heterogeneousAgent/HeterogeneousPersistenceHandler.ts Outdated
@codecov

codecov Bot commented May 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.64%. Comparing base (f934e2f) to head (e66c871).
⚠️ Report is 9 commits behind head on canary.

Additional details and impacted files
@@           Coverage Diff            @@
##           canary   #14539    +/-   ##
========================================
  Coverage   68.64%   68.64%            
========================================
  Files        2623     2623            
  Lines      230163   230190    +27     
  Branches    29087    28252   -835     
========================================
+ Hits       157996   158022    +26     
- Misses      72018    72019     +1     
  Partials      149      149            
Flag Coverage Δ
app 63.20% <96.77%> (+<0.01%) ⬆️
database 92.40% <ø> (ø)
packages/agent-runtime 80.50% <ø> (ø)
packages/builtin-tool-lobe-agent 83.41% <ø> (ø)
packages/context-engine 84.06% <ø> (ø)
packages/conversation-flow 92.43% <ø> (ø)
packages/file-loaders 87.60% <ø> (ø)
packages/memory-user-memory 74.74% <ø> (ø)
packages/model-bank 99.94% <ø> (ø)
packages/model-runtime 83.65% <ø> (ø)
packages/prompts 70.09% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/types 5.02% <ø> (ø)
packages/utils 88.02% <ø> (ø)
packages/web-crawler 88.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 66.92% <ø> (ø)
Services 54.07% <ø> (ø)
Server 70.45% <96.77%> (+<0.01%) ⬆️
Libs 54.03% <ø> (ø)
Utils 79.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


// Restore toolMsgIdByCallId from the DB so tool_results that arrive on a
// different replica than their tool_use can still be matched and persisted.
const toolPlugins = await this.deps.messageModel.listMessagePluginsByTopic(topicId);

@ONLY-yours ONLY-yours May 9, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude Can you evaluate which approach has more overhead: using a DB query like this.deps.messageModel.listMessagePluginsByTopic to ensure tool ordering, versus using an external Redis to guarantee write ordering? Which solution would you prefer?


This comment was translated by Claude.

Original Content

@claude 你可以评估下,使用 this.deps.messageModel.listMessagePluginsByTopic 这种 db 查询来保证 tools 顺序的方式,和 外置一个 redis 来保证写入的方式,两者哪个开销更大呢?你倾向于选择什么方案

@claude claude Bot May 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @ONLY-yours's task in 1m 29s —— View job


DB查询 vs Redis 方案开销对比分析

  • 读取 HeterogeneousPersistenceHandler.ts 理解当前实现
  • 评估两种方案的开销与适用性
  • 给出建议

两种方案的开销分析

方案A:当前实现 — listMessagePluginsByTopic DB查询

关键代码路径在 loadOrCreateState(第 308–369 行):

ingest() 调用
  → loadOrCreateState()
      ├─ operationStates.get(operationId)  ← 热路径:直接返回,零开销
      └─ 冷路径(同一replica首次处理该操作):
          ├─ topicModel.findById()            ← 1次 DB 读
          └─ messageModel.listMessagePluginsByTopic()  ← 1次 DB 读(恢复 toolMap)

特点:

  • 冷启动代价:每个 replica 每个 operation 只有 1次额外 DB 查询,之后全内存操作
  • 热路径代价operationStates 是模块级单例 Map,热路径 O(1) 内存查找,零网络开销
  • 数据一致性:直接读 DB(真正的 source of truth),无 cache 失效问题
  • 适合的并发假设:BatchIngester 设计上是顺序投递——batch N ACK 后才发 batch N+1,所以跨 replica 并发写基本不会发生

方案B:外置 Redis

如果用 Redis 来维护 toolMsgIdByCallId

每次 tool_use 持久化时:
  → messageModel.create()     ← 1次 DB 写
  → redis.hset(key, callId, msgId)  ← 1次 Redis 写(每个 tool_use 都有)

冷启动时:
  → redis.hgetall(key)        ← 1次 Redis 读(替换 DB 查询)

特点:

  • 热路径额外代价:每个 tool_use 事件都需要写 Redis(第 715–716 行写 in-memory map 的地方)
  • 新的故障点:Redis 连接失败、TTL 过期、内存淘汰都需要 fallback 处理
  • 运维复杂度:需要 TTL 管理(避免 key 永久堆积)、连接池配置、故障降级
  • 数据一致性风险:Redis 与 DB 可能短暂不一致(Redis 写成功但 DB 写失败的情况需要回滚或补偿)

关键结论:哪个开销更大?

Redis 方案整体开销更大,原因:

维度 DB查询方案 Redis方案
冷启动代价 1次 DB 查询(SELECT by topic) 1次 Redis 读
热路径代价 0(纯内存) 每个 tool_use 1次 Redis 写
基础设施依赖 已有 DB 新增 Redis
一致性保证 DB 即 source of truth,无需同步 需要双写 + 失败处理
实现复杂度 高(TTL / fallback / 原子操作)

DB 查询的冷启动代价是有界的——listMessagePluginsByTopic 是一次针对单个 topic 的索引查询,结果集通常只有几十到几百条记录。Redis 方案把开销从"冷启动时的一次批量读"变成了"每个 tool_use 的一次写",反而更频繁。


我的建议:保留 DB 查询方案

当前 PR 的实现是正确的权衡:

  1. BatchIngester 顺序投递决定了跨 replica 并发写概率极低,Redis 的分布式锁优势在这里基本用不上
  2. 冷启动有界性已经在 PR 描述中指出,实际影响很小
  3. 无新依赖降低运维负担和故障面

不过有一个值得关注的问题(Codex 也提到了),与 Redis 无关:

heteroCurrentMsgId(第 335–336 行)没有 scope 到 operationId——如果上一次操作的步骤边界写入了这个字段,新操作在 replica 冷启动时会误用旧操作的 assistant message id,直到第一个 stream_start { newStep } 修正它。建议考虑改成 heteroCurrentMsgId_<operationId> 或把它放进 runningOperation 对象内,从而保证 scope 正确。

@ONLY-yours ONLY-yours May 9, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude Can you check whether the two Codex issues have been fixed with the latest changes?


This comment was translated by Claude.

Original Content

@claude 你看看那两个 codex 的问题,在最新的改动下是否修复了?

@claude claude Bot May 9, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @ONLY-yours's task in 1m 48s —— View job


两个 Codex 问题分析

  • 读取最新的 HeterogeneousPersistenceHandler.ts
  • 读取 packages/types/src/topic/topic.ts 确认类型变更
  • 检查 P1:heteroCurrentMsgId 是否已 scope 到 operationId
  • 检查 P2:步骤边界的 metadata 写入是否已具备幂等性

P1:已完整修复 ✅

原始问题heteroCurrentMsgId 只存了一个裸 message ID 字符串,没有 operationId 绑定。新操作冷启动时会误用上一次操作的 step msgId,直到下一个 stream_start { newStep } 才纠正,期间内容写入错误的 assistant row。

修复内容

  1. 类型层面packages/types/src/topic/topic.ts 第 82 行):

    // 旧: heteroCurrentMsgId?: string
    // 新:
    heteroCurrentMsgId?: { msgId: string; operationId: string };
  2. 写入层面handleStepStart 第 475–477 行):

    await this.deps.topicModel.updateMetadata(state.topicId, {
      heteroCurrentMsgId: { msgId: newMsg.id, operationId: state.operationId },
    });
  3. 读取层面loadOrCreateState 第 338–342 行):

    const stored = topic.metadata?.heteroCurrentMsgId;
    const currentAssistantMessageId =
      stored?.operationId === running.operationId
        ? (stored.msgId ?? running.assistantMessageId)
        : running.assistantMessageId;  // 前次操作留下的值,直接忽略

operationId 校验确保跨操作不会复用旧指针,P1 完整解决。


P2:实质性修复 ✅(有意识的权衡)

原始问题handleStepStartstate.currentAssistantMessageId = newMsg.id(内存推进),再 updateMetadata(DB 写)。若 updateMetadata 短暂失败被重试,re-enter 时内存状态已指向新 msg,再创建一个 assistant 就会挂在错误的父节点上,破坏链路。

修复内容handleStepStart 第 444–483 行,注释已说明意图):

// 1. 创建新 assistant message(可能产生孤儿,但不破坏链路)
const newMsg = await this.deps.messageModel.create({...});

// 2. DB 写先于内存推进 (P2 fix)
await this.deps.topicModel.updateMetadata(state.topicId, {
  heteroCurrentMsgId: { msgId: newMsg.id, operationId: state.operationId },
});

// 3. 仅在 DB 写成功后才推进内存状态
state.currentAssistantMessageId = newMsg.id;
state.accumulatedContent = '';
...

效果:若 updateMetadata 失败,内存状态仍停留在上一个 step,重试时 handleStepStart 从正确的父节点重新创建 assistant——第一次的空 message 变成孤儿行,但 不会破坏 turn 链路。代码注释(第 472–474 行)也明确承认了这个权衡。

剩余的细微边界:每次重试都会新建一个孤儿空行。但孤儿行是空的,只影响存储,不影响展示和逻辑。这是 Codex P2 原始关注点("duplicate turns / shift parent chain")已被解决后,剩下的合理代价。


总结

问题 状态 说明
P1: heteroCurrentMsgId 无 operationId scope ✅ 完整修复 改为 { msgId, operationId } 结构,读取时做 operationId 校验
P2: 步骤边界写入非幂等 ✅ 实质修复 内存推进移到 DB 写之后,链路不再被破坏;孤儿空行是已知权衡

两个 Codex 问题在当前分支的改动下均已得到有效处理。

@ONLY-yours ONLY-yours merged commit d0b938a into canary May 9, 2026
47 checks passed
@ONLY-yours ONLY-yours deleted the fix/hetero-persistence-multi-replica branch May 9, 2026 06:36
@Innei Innei mentioned this pull request May 9, 2026
Innei added a commit that referenced this pull request May 9, 2026
# 🚀 LobeHub Release (20260509)

**Release Date:** May 9, 2026  
**Since v2.1.56:** 236 merged PRs · 19 contributors

> Agent Task System reaches general availability, the Agent Signal
pipeline runs nightly self-review with skill-aware policies, the
heterogeneous-agent runtime crosses replica boundaries, inline documents
become a first-class context source, and bot platforms expand across
Messager, Line, and Telegram.

---

## ✨ Highlights

- **Agent Task System (GA)** — End-to-end task execution platform:
templates, tracking, comment tools, parent reassignment, scheduled cron,
and dependency-ordered batch runs. (#14540, #14515, #14517, #14272,
#14246, #14418, #14403, #14488)
- **Agent Signal nightly self-review** — Wired self-review loop with
prompt + DB support, exponential-backoff retry on receipt listing,
skill-aware policy, and improved skill-intent detection. (#14543,
#14542, #14281, #14409, #14526, #14437)
- **Inline documents in KB tool** — BM25 search and `docs_*` read for
inline document grounding; agent documents usable as VFS. (#14494,
#14222)
- **Inline agent cards in chat** — `lobeAgents` markdown tag renders
agent profile cards inline; clickable card after `createAgent`. (#14495,
#14493)
- **Heterogeneous agent runtime** — Cloud hetero exec pipeline steps 3+4
land, persistence recovers across Vercel replicas, server-side
ingest/finish handler, and `lh hetero exec` CLI. (#14486, #14539,
#14444, #14431)
- **Bot platforms expand** — Messager, Line, DM pair policy, and
messenger DB tables; Telegram API path restored. (#14442, #14207,
#14211, #14496, #14519)
- **Visual analysis tool** — New visual understanding tool, with trigger
tracking and flattened schema. (#14378, #14399, #14550)
- **DeepSeek V4 Pro as OSS default** — OSS deployments ship with
DeepSeek V4 Pro by default; DeepSeek Anthropic runtime supported.
(#14555, #14312)

---

## 🏗️ Core Agent & Architecture

### Agent Task System

- **Task System GA** — End-to-end execution platform now available.
(#14540)
- **Templates, comments, reparenting** — Template tracking, comment
tools, and parent reassignment. (#14515, #14517, #14488)
- **Cron + dependency-ordered runs** — Scheduled status with cron editor
and dependency-ordered subtask batches. (#14246, #14418, #14272)
- **Inspector + chip UI + batch tasks** — Task Inspector/Render
registry, batch `createTasks`/`runTasks`, and chip-based agent-documents
inspector. (#14403, #14404)
- **Recommend templates regardless of brief count** — Recommendations no
longer suppressed when briefs are sparse. (#14508)
- **Scheduling resilience** — Manual run no longer eats next scheduled
tick; recurring tasks survive brief resolution. (#14304, #14348)
- **Brief synthesis** — Auto-synthesize topic briefs; brief actions
revamp; mute resolved-brief icon on home. (#14324, #14228, #14452)
- **Task list & detail polish** — Topic operation ID exposed; task
drawer Gateway reconnect. (#14282)

### Agent Signal pipeline

- **Nightly self-review wired** — Prompt + DB support for the
self-review loop. (#14543)
- **Self-review activities push to briefs** — Activities during nightly
self-reflection now create briefs. (#14437)
- **Skill management policy** — New policy for Skill management running
inside Agent Signal. (#14281)
- **Skill intent detection & routing** — Improved detection plus direct
intent handling when `hintIsSkill`. (#14409, #14526)
- **Document tool outcome rendering** — Decision view restores missing
document tool outcomes. (#14534)
- **Exponential backoff retry** — Listing signal receipts retries with
jittered backoff. (#14542)
- **Easier-to-use signals** — Structural simplification +
recent-activities surface for receipts. (#14290, #14326, #14407)

### Heterogeneous agent runtime

- **Cloud hetero exec pipeline (steps 3 + 4)** — Refactor lands the next
two stages of the cloud hetero agent execution pipeline. (#14486)
- **Persistence recovery on Vercel** — Hetero state recovered across
replica boundaries. (#14539)
- **Server-side ingest/finish + persistence** — `aiAgent.heteroIngest` /
`heteroFinish` handlers. (#14444)
- **`lh hetero exec` CLI** — Standalone heterogeneous agent runs from
CLI. (#14431)
- **Gateway round-trip loading** — `execAgentTask` keeps the input box
in loading state through the full round-trip. (#14503)
- **Provider SDK type routing** — Provider routing now respects SDK
type. (#14520)
- **DeepSeek reasoning preserved** — `reasoning_content` preserved in
OpenAI-compatible runtime for DeepSeek models. (#14546)

### Knowledge & inline docs

- **KB tool BM25 + docs read** — BM25 search and `docs_*` read
integrated for inline documents. (#14494)
- **Agent documents as VFS** — FS-compatible output for agent documents.
(#14222)
- **`lobeAgents` markdown tag** — Inline agent cards rendered from a
markdown tag. (#14495)
- **Clickable agent card after `createAgent`** — Mentions and
recommendations become clickable. (#14493)
- **ExplorerTree** — Generic tree component built on `@pierre/trees` for
reusable explorer surfaces. (#14094)
- **Local file mention snapshots** — Mentions can now snapshot local
files. (#14278)

### Architecture

- **Agent Hono routes** — New agent routes added on Hono. (#14535)
- **`/api/agent` migrated to Hono** — Remaining `/api/agent` routes
finish their migration. (#14478)
- **Agent marketplace merged into web-onboarding** — Reduces package
fragmentation. (#14514)
- **Producer pipeline extracted** — Shared package for the producer
pipeline. (#14425)
- **`agentDispatcher.selectRuntimeType`** — New runtime selection
abstraction. (#14428)
- **pnpm v11 migration** — Workspace consolidated. (#14316)
- **Browser-compatible frontmatter parser** — Replaces `gray-matter`.
(#14435)

---

## 📱 Platforms & Integrations

- **Messager support** — New messager package wired into the chat
surface. (#14442)
- **Messenger DB tables** — IM bot integration gains its persistence
layer. (#14496)
- **Line bot** — Initial Line support and downstream optimization.
(#14207, #14448)
- **DM pair policy** — Group/DM pair-based delivery. (#14211)
- **Telegram API restored** — Missing Telegram API path reconnected.
(#14519)
- **xAI Responses tools stabilized** — Plus unsupported parameter
handling. (#14462, #14445)
- **Volcengine websearch via ResponseAPI** — Built-in websearch for
Volcengine. (#14216)

---

## 🤖 Models & Providers

- **DeepSeek V4 Pro default for OSS** — OSS distribution defaults to
DeepSeek V4 Pro. (#14555)
- **DeepSeek Anthropic runtime** — Anthropic-shape runtime support for
DeepSeek. (#14312)
- **GPT-5.5 / GPT-5.5 Pro** — New OpenAI tier. (#14142)
- **Grok 4.20 / Grok 4.3 / LobeHub-hosted Grok 4.3** — (#14253, #14382,
#14446)
- **Gemma 4 + provider settings normalization** — (#13313)
- **gpt-image-2 + step-image-edit-2** — (#14253, #14329)
- **Model bank refresh + original-pricing display** — Batch model
updates and pricing surfaces. (#14070, #14391)
- **Hunyuan migrated to TokenHub for Hy3 Preview** — (#14108)
- **Reject lobehub model ids no longer in the bank** — (#14261)
- **Hide runtime-only aliases** — Runtime-only model aliases no longer
leak into the model picker. (#14552)

---

## 🖥️ User Experience

### Onboarding

- **Shared prefix steps** — Language and privacy extracted as shared
prefix steps. (#14538)
- **Identity intervention card simplified** — Plus tool result renders
cleanup. (#14505, #14506)
- **Welcome polish + web-onboarding tool UI** — (#14475)
- **Templates fetched from market API** — (#14286)
- **Virtual model id for default onboarding model** — (#14311)
- **Skip / mode-switch footer behind feature flag** — Footer guarded for
desktop and web initialization. (#14560)

### Home & navigation

- **Home recents performance** — Recents refresh periodically and inline
task status; brief and task-template fetch overhead trimmed. (#14518,
#14516)
- **Home refactor + skill-connect recommendations** — Restructured home
with skill-connect recommendation system. (#14266, #14214)
- **Tasks in agent sidebar** — Tasks moved from welcome card into the
sidebar list. (#14500)
- **Sidebar collapse persists** — Home sidebar collapse state stored.
(#14473)
- **Agent-specific topic grouping** — Plus improved empty state and
agent identity in topic search. (#14225)
- **MentionMenu scroll fix** — Mention menu no longer clips inside chat
input. (#14533)

### Conversation & chat

- **Follow-up chips fill input** — Clicking a follow-up chip now fills
the input instead of sending immediately. (#14536)
- **Quick-reply chips below assistant messages** — (#14350)
- **Inline single-tool assistant group + leading sentence promotion** —
(#14244)
- **Assistant-group rendering** — Per-segment content overrides flow
into MessageContent. (#14504)
- **Tool call timer fix** — Timer no longer resets when tool calls
collapse or expand. (#14513)
- **Streaming re-render reduction** — Reference stabilization and
self-subscribing components. (#14470)
- **Topic chat drawer feedback input** — (#14392)

### Skills, agents, devtools

- **Managed skill folders** — Agent view displays managed skill folders
and aligns delete confirmations. (#14553)
- **Review tab + bulk git diffs** — New Review tab with bulk diffs;
gating uses effective working directory. (#14334, #14512)
- **Devtools gallery rebuild** — Plus Review polish, queue-tray images.
(#14423)
- **Agent mock devtools** — Playback & fixture viewer. (#14436)

### Desktop & CLI

- **App tray visibility setting** — (#14463)
- **Notification settings in desktop** — (#14491)
- **Multimodal input across CLI / shared spawn / desktop** — (#14433)
- **CLI bot + userId guide** — (#14258)

---

## 🔧 Tooling

- **Visual analysis tool** — New visual understanding tool with
flattened schema. (#14378, #14550)
- **GitHub marketplace tool UI** — (#14420)
- **Drop "Local" prefix and `____builtin` suffix from tool names** —
(#14364, #14289)
- **Sanitize provider tool names** — Avoids invalid characters from
external providers. (#14510)
- **Generation moderation context** — Moderation context passed through
the generation pipeline. (#14541)
- **Visual analysis trigger tracking** — (#14399)
- **Claude thinking signature sanitization** — History signatures
sanitized when replaying Claude conversations. (#14499)
- **Responses input media sanitization** — Assistant media sanitized in
Responses input. (#14497)

---

## 🔒 Security & Reliability

- **Security:** Removed the `/webapi/proxy` route and dead URL-manifest
plugin code to shrink the SSRF surface. (#14549)
- **Security:** Sessions revoked after password reset. (#14424)
- **Reliability:** Added `prompt_cache_key` to OpenAI chat requests for
stable cache hits. (#14349)
- **Reliability:** `onFinish` now fires even when the browser tab is
backgrounded mid-SSE stream. (#14461)
- **Reliability:** Better-auth session refetch preserves user fields
rather than overwriting them. (#14531)
- **Reliability:** User-memory queries sanitize backticks; user-memory
errors now explicitly injected so failures stay visible. (#14524,
#14525)
- **Reliability:** Auth captcha retries handled; input loading unsticks
on `auth_failed` and recoverable `auth_expired`. (#14346, #14419)
- **Reliability:** Trace snapshot finalized on error path. (#14440)
- **Reliability:** Drop `switchTopic` race under rapid sidebar clicks.
(#14115)
- **Reliability:** PDF chunking logic fixed to prevent vectorization
failure. (#14327)
- **Performance:** Marketplace fork uses a batched API for parallel
installs. (#14537)
- **Performance:** Review tab open latency cut ~9× on large dirty trees.
(#14338)

---

## 👥 Contributors

Huge thanks to **18 contributors** who shipped **236 merged PRs** this
cycle.

@hezhijie0327 · @sxjeru · @yueyinqiu · @octo-patch · @hardy-one ·
@Coooolfan · @CanYuanA · @BillionClaw · @arvinxx · @tjx666 · @Innei ·
@neko · @AmAzing129 · @rdmclin2 · @lijian · @sudongyuer · @rivertwilight
· @cy948

Plus @lobehubbot for i18n and translation maintenance.

---

**Full Changelog**:
v2.1.56...release/weekly-20260509
ONLY-yours added a commit that referenced this pull request May 9, 2026
Vercel serverless functions are stateless per-request, so `operationStates`
is empty on every `heteroIngest` call. loadOrCreateState always cold-creates.

#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`,
`toolState.payloads`, and `toolState.persistedIds` empty on cold load,
causing two bugs:

- Content truncation: cold instance starts with `accumulatedContent=''`,
  accumulates only the current batch's text, then writes that shorter string
  on the next step boundary or terminal — overwriting the longer content the
  previous write had already stored in DB.

- Tool duplication / tools[] overwrite: `persistedIds={}` on cold load
  means every `tools_calling` event re-creates already-persisted tool
  messages, and `payloads=[]` means phase 1/3 writes only the current
  batch's tools, wiping previous tools from `assistant.tools[]`.

Fix: in `loadOrCreateState`, fetch the current assistant message and restore
`accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and
`toolState.persistedIds` from it. Cold load is now equivalent to warm load.

Also adds two regression tests covering the cold-replica scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ONLY-yours added a commit that referenced this pull request May 9, 2026
…ingStatus (#14592)

* 🐛 fix: skip reconnect when gateway action already established a connection

Race condition on new-topic first message:
1. switchTopic loads runningOperation → useGatewayReconnect fires
2. executeGatewayAgent calls connectToGateway (status: connecting)
3. reconnectToGatewayOperation overwrites with resumeOnConnect:true
4. Gateway sees resume on a brand-new session → no events → stuck

Second message works because the client store's runningOperation is
stale (from the first op), so SWR deduplications and no reconnect fires.

Fix: bail out of reconnectToGatewayOperation if gatewayConnections
already shows connecting/connected for that operationId.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume

CC stores session files at ~/.claude/projects/<encoded-cwd>/.
Without an explicit --cwd the actual working directory can differ
between sandbox invocations, so --resume <heteroSessionId> fails
to locate the previous session files even though the container is
persistent and the ID is correctly stored in topic.metadata.

Default cwd to /workspace for cloud runs (desktop keeps its own
explicit path), guaranteeing a stable session-file location across
page reloads within the same sandbox lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: extend reconnect guard to cover all in-flight connection statuses

The previous guard only skipped reconnect for 'connecting'/'connected'
but the connection can already be in 'authenticating' or 'reconnecting'
by the time useGatewayReconnect fires, leaving the race window open.

Flip the condition: skip for any status that is not 'disconnected'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler

Vercel serverless functions are stateless per-request, so `operationStates`
is empty on every `heteroIngest` call. loadOrCreateState always cold-creates.

#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`,
`toolState.payloads`, and `toolState.persistedIds` empty on cold load,
causing two bugs:

- Content truncation: cold instance starts with `accumulatedContent=''`,
  accumulates only the current batch's text, then writes that shorter string
  on the next step boundary or terminal — overwriting the longer content the
  previous write had already stored in DB.

- Tool duplication / tools[] overwrite: `persistedIds={}` on cold load
  means every `tools_calling` event re-creates already-persisted tool
  messages, and `payloads=[]` means phase 1/3 writes only the current
  batch's tools, wiping previous tools from `assistant.tools[]`.

Fix: in `loadOrCreateState`, fetch the current assistant message and restore
`accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and
`toolState.persistedIds` from it. Cold load is now equivalent to warm load.

Also adds two regression tests covering the cold-replica scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Innei pushed a commit to Innei/lobehub that referenced this pull request May 12, 2026
…ingStatus (lobehub#14592)

* 🐛 fix: skip reconnect when gateway action already established a connection

Race condition on new-topic first message:
1. switchTopic loads runningOperation → useGatewayReconnect fires
2. executeGatewayAgent calls connectToGateway (status: connecting)
3. reconnectToGatewayOperation overwrites with resumeOnConnect:true
4. Gateway sees resume on a brand-new session → no events → stuck

Second message works because the client store's runningOperation is
stale (from the first op), so SWR deduplications and no reconnect fires.

Fix: bail out of reconnectToGatewayOperation if gatewayConnections
already shows connecting/connected for that operationId.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume

CC stores session files at ~/.claude/projects/<encoded-cwd>/.
Without an explicit --cwd the actual working directory can differ
between sandbox invocations, so --resume <heteroSessionId> fails
to locate the previous session files even though the container is
persistent and the ID is correctly stored in topic.metadata.

Default cwd to /workspace for cloud runs (desktop keeps its own
explicit path), guaranteeing a stable session-file location across
page reloads within the same sandbox lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: extend reconnect guard to cover all in-flight connection statuses

The previous guard only skipped reconnect for 'connecting'/'connected'
but the connection can already be in 'authenticating' or 'reconnecting'
by the time useGatewayReconnect fires, leaving the race window open.

Flip the condition: skip for any status that is not 'disconnected'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler

Vercel serverless functions are stateless per-request, so `operationStates`
is empty on every `heteroIngest` call. loadOrCreateState always cold-creates.

lobehub#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`,
`toolState.payloads`, and `toolState.persistedIds` empty on cold load,
causing two bugs:

- Content truncation: cold instance starts with `accumulatedContent=''`,
  accumulates only the current batch's text, then writes that shorter string
  on the next step boundary or terminal — overwriting the longer content the
  previous write had already stored in DB.

- Tool duplication / tools[] overwrite: `persistedIds={}` on cold load
  means every `tools_calling` event re-creates already-persisted tool
  messages, and `payloads=[]` means phase 1/3 writes only the current
  batch's tools, wiping previous tools from `assistant.tools[]`.

Fix: in `loadOrCreateState`, fetch the current assistant message and restore
`accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and
`toolState.persistedIds` from it. Cold load is now equivalent to warm load.

Also adds two regression tests covering the cold-replica scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
arvinxx pushed a commit that referenced this pull request May 12, 2026
…ingStatus (#14592)

* 🐛 fix: skip reconnect when gateway action already established a connection

Race condition on new-topic first message:
1. switchTopic loads runningOperation → useGatewayReconnect fires
2. executeGatewayAgent calls connectToGateway (status: connecting)
3. reconnectToGatewayOperation overwrites with resumeOnConnect:true
4. Gateway sees resume on a brand-new session → no events → stuck

Second message works because the client store's runningOperation is
stale (from the first op), so SWR deduplications and no reconnect fires.

Fix: bail out of reconnectToGatewayOperation if gatewayConnections
already shows connecting/connected for that operationId.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume

CC stores session files at ~/.claude/projects/<encoded-cwd>/.
Without an explicit --cwd the actual working directory can differ
between sandbox invocations, so --resume <heteroSessionId> fails
to locate the previous session files even though the container is
persistent and the ID is correctly stored in topic.metadata.

Default cwd to /workspace for cloud runs (desktop keeps its own
explicit path), guaranteeing a stable session-file location across
page reloads within the same sandbox lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: extend reconnect guard to cover all in-flight connection statuses

The previous guard only skipped reconnect for 'connecting'/'connected'
but the connection can already be in 'authenticating' or 'reconnecting'
by the time useGatewayReconnect fires, leaving the race window open.

Flip the condition: skip for any status that is not 'disconnected'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler

Vercel serverless functions are stateless per-request, so `operationStates`
is empty on every `heteroIngest` call. loadOrCreateState always cold-creates.

#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`,
`toolState.payloads`, and `toolState.persistedIds` empty on cold load,
causing two bugs:

- Content truncation: cold instance starts with `accumulatedContent=''`,
  accumulates only the current batch's text, then writes that shorter string
  on the next step boundary or terminal — overwriting the longer content the
  previous write had already stored in DB.

- Tool duplication / tools[] overwrite: `persistedIds={}` on cold load
  means every `tools_calling` event re-creates already-persisted tool
  messages, and `payloads=[]` means phase 1/3 writes only the current
  batch's tools, wiping previous tools from `assistant.tools[]`.

Fix: in `loadOrCreateState`, fetch the current assistant message and restore
`accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and
`toolState.persistedIds` from it. Cold load is now equivalent to warm load.

Also adds two regression tests covering the cold-replica scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
lezi-fun pushed a commit to lezi-fun/lobehub that referenced this pull request May 13, 2026
…ingStatus (lobehub#14592)

* 🐛 fix: skip reconnect when gateway action already established a connection

Race condition on new-topic first message:
1. switchTopic loads runningOperation → useGatewayReconnect fires
2. executeGatewayAgent calls connectToGateway (status: connecting)
3. reconnectToGatewayOperation overwrites with resumeOnConnect:true
4. Gateway sees resume on a brand-new session → no events → stuck

Second message works because the client store's runningOperation is
stale (from the first op), so SWR deduplications and no reconnect fires.

Fix: bail out of reconnectToGatewayOperation if gatewayConnections
already shows connecting/connected for that operationId.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume

CC stores session files at ~/.claude/projects/<encoded-cwd>/.
Without an explicit --cwd the actual working directory can differ
between sandbox invocations, so --resume <heteroSessionId> fails
to locate the previous session files even though the container is
persistent and the ID is correctly stored in topic.metadata.

Default cwd to /workspace for cloud runs (desktop keeps its own
explicit path), guaranteeing a stable session-file location across
page reloads within the same sandbox lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: extend reconnect guard to cover all in-flight connection statuses

The previous guard only skipped reconnect for 'connecting'/'connected'
but the connection can already be in 'authenticating' or 'reconnecting'
by the time useGatewayReconnect fires, leaving the race window open.

Flip the condition: skip for any status that is not 'disconnected'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler

Vercel serverless functions are stateless per-request, so `operationStates`
is empty on every `heteroIngest` call. loadOrCreateState always cold-creates.

lobehub#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`,
`toolState.payloads`, and `toolState.persistedIds` empty on cold load,
causing two bugs:

- Content truncation: cold instance starts with `accumulatedContent=''`,
  accumulates only the current batch's text, then writes that shorter string
  on the next step boundary or terminal — overwriting the longer content the
  previous write had already stored in DB.

- Tool duplication / tools[] overwrite: `persistedIds={}` on cold load
  means every `tools_calling` event re-creates already-persisted tool
  messages, and `payloads=[]` means phase 1/3 writes only the current
  batch's tools, wiping previous tools from `assistant.tools[]`.

Fix: in `loadOrCreateState`, fetch the current assistant message and restore
`accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and
`toolState.persistedIds` from it. Cold load is now equivalent to warm load.

Also adds two regression tests covering the cold-replica scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant