🐛 fix: recover hetero persistence state across Vercel replicas by ONLY-yours · Pull Request #14539 · lobehub/lobehub

ONLY-yours · 2026-05-08T17:16:52Z

💻 Change Type

🐛 fix

🔗 Related Issue

Related to LOBE-8516 (cloud hetero agent pipeline)

🔀 Description of Change

Fixes message fragmentation and orphaned tool calls caused by Vercel serverless routing the same operation's heteroIngest batches to different Node.js instances.

Root cause: HeterogeneousPersistenceHandler keeps operationStates in process memory. When consecutive batches land on different replicas, the second replica starts with a blank state — wrong currentAssistantMessageId, empty toolMsgIdByCallId, and lost accumulated text.

Three-part fix (no Redis, no new infra):

Flush content after every batch — ingest() now writes accumulatedContent/accumulatedReasoning to DB at the end of each call, so a replica switch mid-stream doesn't discard buffered text (fixes isolated single-character messages like "我").
Persist heteroCurrentMsgId on step boundaries — handleStepStart now writes the new assistant message id to topic.metadata.heteroCurrentMsgId. A fresh replica reads this field instead of falling back to the stale initial placeholder, so new content lands on the correct message.
Restore toolMsgIdByCallId from DB on state creation — loadOrCreateState now queries listMessagePluginsByTopic to rebuild the tool call → tool message mapping, so tool_result events on a new replica can still find and update their counterpart rows.

Desktop unaffected: HeterogeneousPersistenceHandler is only invoked via heteroIngest/heteroFinish tRPC. Desktop hetero agent uses the renderer-side IPC path and never touches this handler.

🧪 How to Test

Deploy to Vercel (multi-instance) and run a multi-step Claude Code task
Verify no isolated text fragments or orphaned tool call warnings in the conversation

📝 Additional Information

The in-memory singleton remains for same-replica fast-path. The DB reads in loadOrCreateState are only triggered on a cold state (first batch per replica per operation), so the overhead is bounded to one listMessagePluginsByTopic query per replica cold-start.

🤖 Generated with Claude Code

Three-part fix for multi-replica split-brain on Vercel serverless: 1. Flush accumulated content to DB after every ingest batch so a replica switch mid-accumulation doesn't lose text chunks. 2. Persist `heteroCurrentMsgId` to topic.metadata on every step boundary so new replicas restore the correct currentAssistantMessageId. 3. Restore toolMsgIdByCallId from DB on state creation so tool_results landing on a different replica than their tool_use are still matched. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vercel · 2026-05-08T17:16:58Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
lobehub	Ready	Preview, Comment	May 9, 2026 5:09am

sourcery-ai

We've reviewed this pull request using the Sourcery rules engine

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 023750e072

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

codecov · 2026-05-08T17:26:53Z

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.64%. Comparing base (f934e2f) to head (e66c871).
⚠️ Report is 9 commits behind head on canary.

Additional details and impacted files

@@           Coverage Diff            @@
##           canary   #14539    +/-   ##
========================================
  Coverage   68.64%   68.64%            
========================================
  Files        2623     2623            
  Lines      230163   230190    +27     
  Branches    29087    28252   -835     
========================================
+ Hits       157996   158022    +26     
- Misses      72018    72019     +1     
  Partials      149      149

Flag	Coverage Δ
app	`63.20% <96.77%> (+<0.01%)`	⬆️
database	`92.40% <ø> (ø)`
packages/agent-runtime	`80.50% <ø> (ø)`
packages/builtin-tool-lobe-agent	`83.41% <ø> (ø)`
packages/context-engine	`84.06% <ø> (ø)`
packages/conversation-flow	`92.43% <ø> (ø)`
packages/file-loaders	`87.60% <ø> (ø)`
packages/memory-user-memory	`74.74% <ø> (ø)`
packages/model-bank	`99.94% <ø> (ø)`
packages/model-runtime	`83.65% <ø> (ø)`
packages/prompts	`70.09% <ø> (ø)`
packages/python-interpreter	`92.90% <ø> (ø)`
packages/ssrf-safe-fetch	`0.00% <ø> (ø)`
packages/types	`5.02% <ø> (ø)`
packages/utils	`88.02% <ø> (ø)`
packages/web-crawler	`88.29% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Store	`66.92% <ø> (ø)`
Services	`54.07% <ø> (ø)`
Server	`70.45% <96.77%> (+<0.01%)`	⬆️
Libs	`54.03% <ø> (ø)`
Utils	`79.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ONLY-yours · 2026-05-09T04:56:34Z

+
+    // Restore toolMsgIdByCallId from the DB so tool_results that arrive on a
+    // different replica than their tool_use can still be matched and persisted.
+    const toolPlugins = await this.deps.messageModel.listMessagePluginsByTopic(topicId);


@claude Can you evaluate which approach has more overhead: using a DB query like this.deps.messageModel.listMessagePluginsByTopic to ensure tool ordering, versus using an external Redis to guarantee write ordering? Which solution would you prefer?

This comment was translated by Claude.

Original Content

@claude 你可以评估下，使用 this.deps.messageModel.listMessagePluginsByTopic 这种 db 查询来保证 tools 顺序的方式，和外置一个 redis 来保证写入的方式，两者哪个开销更大呢？你倾向于选择什么方案

Claude finished @ONLY-yours's task in 1m 29s —— View job

DB查询 vs Redis 方案开销对比分析

读取 HeterogeneousPersistenceHandler.ts 理解当前实现

评估两种方案的开销与适用性

给出建议

两种方案的开销分析

方案A：当前实现 — listMessagePluginsByTopic DB查询

关键代码路径在 loadOrCreateState（第 308–369 行）：

ingest() 调用 → loadOrCreateState() ├─ operationStates.get(operationId) ← 热路径：直接返回，零开销 └─ 冷路径（同一replica首次处理该操作）： ├─ topicModel.findById() ← 1次 DB 读 └─ messageModel.listMessagePluginsByTopic() ← 1次 DB 读（恢复 toolMap）

特点：

冷启动代价：每个 replica 每个 operation 只有 1次额外 DB 查询，之后全内存操作

热路径代价：operationStates 是模块级单例 Map，热路径 O(1) 内存查找，零网络开销

数据一致性：直接读 DB（真正的 source of truth），无 cache 失效问题

适合的并发假设：BatchIngester 设计上是顺序投递——batch N ACK 后才发 batch N+1，所以跨 replica 并发写基本不会发生

方案B：外置 Redis

如果用 Redis 来维护 toolMsgIdByCallId：

每次 tool_use 持久化时： → messageModel.create() ← 1次 DB 写 → redis.hset(key, callId, msgId) ← 1次 Redis 写（每个 tool_use 都有）冷启动时： → redis.hgetall(key) ← 1次 Redis 读（替换 DB 查询）

特点：

热路径额外代价：每个 tool_use 事件都需要写 Redis（第 715–716 行写 in-memory map 的地方）

新的故障点：Redis 连接失败、TTL 过期、内存淘汰都需要 fallback 处理

运维复杂度：需要 TTL 管理（避免 key 永久堆积）、连接池配置、故障降级

数据一致性风险：Redis 与 DB 可能短暂不一致（Redis 写成功但 DB 写失败的情况需要回滚或补偿）

关键结论：哪个开销更大？

Redis 方案整体开销更大，原因：

维度 DB查询方案 Redis方案

冷启动代价 1次 DB 查询（SELECT by topic） 1次 Redis 读

热路径代价 0（纯内存） 每个 tool_use 1次 Redis 写

基础设施依赖已有 DB 新增 Redis

一致性保证 DB 即 source of truth，无需同步需要双写 + 失败处理

实现复杂度低高（TTL / fallback / 原子操作）

DB 查询的冷启动代价是有界的——listMessagePluginsByTopic 是一次针对单个 topic 的索引查询，结果集通常只有几十到几百条记录。Redis 方案把开销从"冷启动时的一次批量读"变成了"每个 tool_use 的一次写"，反而更频繁。

我的建议：保留 DB 查询方案

当前 PR 的实现是正确的权衡：

BatchIngester 顺序投递决定了跨 replica 并发写概率极低，Redis 的分布式锁优势在这里基本用不上

冷启动有界性已经在 PR 描述中指出，实际影响很小

无新依赖降低运维负担和故障面

不过有一个值得关注的问题（Codex 也提到了），与 Redis 无关：

heteroCurrentMsgId（第 335–336 行）没有 scope 到 operationId——如果上一次操作的步骤边界写入了这个字段，新操作在 replica 冷启动时会误用旧操作的 assistant message id，直到第一个 stream_start { newStep } 修正它。建议考虑改成 heteroCurrentMsgId_<operationId> 或把它放进 runningOperation 对象内，从而保证 scope 正确。

@claude Can you check whether the two Codex issues have been fixed with the latest changes?

This comment was translated by Claude.

Original Content

@claude 你看看那两个 codex 的问题，在最新的改动下是否修复了？

Claude finished @ONLY-yours's task in 1m 48s —— View job

两个 Codex 问题分析

读取最新的 HeterogeneousPersistenceHandler.ts

读取 packages/types/src/topic/topic.ts 确认类型变更

检查 P1：heteroCurrentMsgId 是否已 scope 到 operationId

检查 P2：步骤边界的 metadata 写入是否已具备幂等性

P1：已完整修复 ✅

原始问题：heteroCurrentMsgId 只存了一个裸 message ID 字符串，没有 operationId 绑定。新操作冷启动时会误用上一次操作的 step msgId，直到下一个 stream_start { newStep } 才纠正，期间内容写入错误的 assistant row。

修复内容：

类型层面（packages/types/src/topic/topic.ts 第 82 行）：

// 旧: heteroCurrentMsgId?: string // 新: heteroCurrentMsgId?: { msgId: string; operationId: string };

写入层面（handleStepStart 第 475–477 行）：

await this.deps.topicModel.updateMetadata(state.topicId, { heteroCurrentMsgId: { msgId: newMsg.id, operationId: state.operationId }, });

读取层面（loadOrCreateState 第 338–342 行）：

const stored = topic.metadata?.heteroCurrentMsgId; const currentAssistantMessageId = stored?.operationId === running.operationId ? (stored.msgId ?? running.assistantMessageId) : running.assistantMessageId; // 前次操作留下的值，直接忽略

operationId 校验确保跨操作不会复用旧指针，P1 完整解决。

P2：实质性修复 ✅（有意识的权衡）

原始问题：handleStepStart 先 state.currentAssistantMessageId = newMsg.id（内存推进），再 updateMetadata（DB 写）。若 updateMetadata 短暂失败被重试，re-enter 时内存状态已指向新 msg，再创建一个 assistant 就会挂在错误的父节点上，破坏链路。

修复内容（handleStepStart 第 444–483 行，注释已说明意图）：

// 1. 创建新 assistant message（可能产生孤儿，但不破坏链路） const newMsg = await this.deps.messageModel.create({...}); // 2. DB 写先于内存推进 (P2 fix) await this.deps.topicModel.updateMetadata(state.topicId, { heteroCurrentMsgId: { msgId: newMsg.id, operationId: state.operationId }, }); // 3. 仅在 DB 写成功后才推进内存状态 state.currentAssistantMessageId = newMsg.id; state.accumulatedContent = ''; ...

效果：若 updateMetadata 失败，内存状态仍停留在上一个 step，重试时 handleStepStart 从正确的父节点重新创建 assistant——第一次的空 message 变成孤儿行，但 不会破坏 turn 链路。代码注释（第 472–474 行）也明确承认了这个权衡。

剩余的细微边界：每次重试都会新建一个孤儿空行。但孤儿行是空的，只影响存储，不影响展示和逻辑。这是 Codex P2 原始关注点（"duplicate turns / shift parent chain"）已被解决后，剩下的合理代价。

总结

问题状态说明

P1: heteroCurrentMsgId 无 operationId scope ✅ 完整修复改为 { msgId, operationId } 结构，读取时做 operationId 校验

P2: 步骤边界写入非幂等 ✅ 实质修复内存推进移到 DB 写之后，链路不再被破坏；孤儿空行是已知权衡

两个 Codex 问题在当前分支的改动下均已得到有效处理。

@hezhijie0327

# 🚀 LobeHub Release (20260509) **Release Date:** May 9, 2026 **Since v2.1.56:** 236 merged PRs · 19 contributors > Agent Task System reaches general availability, the Agent Signal pipeline runs nightly self-review with skill-aware policies, the heterogeneous-agent runtime crosses replica boundaries, inline documents become a first-class context source, and bot platforms expand across Messager, Line, and Telegram. --- ## ✨ Highlights - **Agent Task System (GA)** — End-to-end task execution platform: templates, tracking, comment tools, parent reassignment, scheduled cron, and dependency-ordered batch runs. (#14540, #14515, #14517, #14272, #14246, #14418, #14403, #14488) - **Agent Signal nightly self-review** — Wired self-review loop with prompt + DB support, exponential-backoff retry on receipt listing, skill-aware policy, and improved skill-intent detection. (#14543, #14542, #14281, #14409, #14526, #14437) - **Inline documents in KB tool** — BM25 search and `docs_*` read for inline document grounding; agent documents usable as VFS. (#14494, #14222) - **Inline agent cards in chat** — `lobeAgents` markdown tag renders agent profile cards inline; clickable card after `createAgent`. (#14495, #14493) - **Heterogeneous agent runtime** — Cloud hetero exec pipeline steps 3+4 land, persistence recovers across Vercel replicas, server-side ingest/finish handler, and `lh hetero exec` CLI. (#14486, #14539, #14444, #14431) - **Bot platforms expand** — Messager, Line, DM pair policy, and messenger DB tables; Telegram API path restored. (#14442, #14207, #14211, #14496, #14519) - **Visual analysis tool** — New visual understanding tool, with trigger tracking and flattened schema. (#14378, #14399, #14550) - **DeepSeek V4 Pro as OSS default** — OSS deployments ship with DeepSeek V4 Pro by default; DeepSeek Anthropic runtime supported. (#14555, #14312) --- ## 🏗️ Core Agent & Architecture ### Agent Task System - **Task System GA** — End-to-end execution platform now available. (#14540) - **Templates, comments, reparenting** — Template tracking, comment tools, and parent reassignment. (#14515, #14517, #14488) - **Cron + dependency-ordered runs** — Scheduled status with cron editor and dependency-ordered subtask batches. (#14246, #14418, #14272) - **Inspector + chip UI + batch tasks** — Task Inspector/Render registry, batch `createTasks`/`runTasks`, and chip-based agent-documents inspector. (#14403, #14404) - **Recommend templates regardless of brief count** — Recommendations no longer suppressed when briefs are sparse. (#14508) - **Scheduling resilience** — Manual run no longer eats next scheduled tick; recurring tasks survive brief resolution. (#14304, #14348) - **Brief synthesis** — Auto-synthesize topic briefs; brief actions revamp; mute resolved-brief icon on home. (#14324, #14228, #14452) - **Task list & detail polish** — Topic operation ID exposed; task drawer Gateway reconnect. (#14282) ### Agent Signal pipeline - **Nightly self-review wired** — Prompt + DB support for the self-review loop. (#14543) - **Self-review activities push to briefs** — Activities during nightly self-reflection now create briefs. (#14437) - **Skill management policy** — New policy for Skill management running inside Agent Signal. (#14281) - **Skill intent detection & routing** — Improved detection plus direct intent handling when `hintIsSkill`. (#14409, #14526) - **Document tool outcome rendering** — Decision view restores missing document tool outcomes. (#14534) - **Exponential backoff retry** — Listing signal receipts retries with jittered backoff. (#14542) - **Easier-to-use signals** — Structural simplification + recent-activities surface for receipts. (#14290, #14326, #14407) ### Heterogeneous agent runtime - **Cloud hetero exec pipeline (steps 3 + 4)** — Refactor lands the next two stages of the cloud hetero agent execution pipeline. (#14486) - **Persistence recovery on Vercel** — Hetero state recovered across replica boundaries. (#14539) - **Server-side ingest/finish + persistence** — `aiAgent.heteroIngest` / `heteroFinish` handlers. (#14444) - **`lh hetero exec` CLI** — Standalone heterogeneous agent runs from CLI. (#14431) - **Gateway round-trip loading** — `execAgentTask` keeps the input box in loading state through the full round-trip. (#14503) - **Provider SDK type routing** — Provider routing now respects SDK type. (#14520) - **DeepSeek reasoning preserved** — `reasoning_content` preserved in OpenAI-compatible runtime for DeepSeek models. (#14546) ### Knowledge & inline docs - **KB tool BM25 + docs read** — BM25 search and `docs_*` read integrated for inline documents. (#14494) - **Agent documents as VFS** — FS-compatible output for agent documents. (#14222) - **`lobeAgents` markdown tag** — Inline agent cards rendered from a markdown tag. (#14495) - **Clickable agent card after `createAgent`** — Mentions and recommendations become clickable. (#14493) - **ExplorerTree** — Generic tree component built on `@pierre/trees` for reusable explorer surfaces. (#14094) - **Local file mention snapshots** — Mentions can now snapshot local files. (#14278) ### Architecture - **Agent Hono routes** — New agent routes added on Hono. (#14535) - **`/api/agent` migrated to Hono** — Remaining `/api/agent` routes finish their migration. (#14478) - **Agent marketplace merged into web-onboarding** — Reduces package fragmentation. (#14514) - **Producer pipeline extracted** — Shared package for the producer pipeline. (#14425) - **`agentDispatcher.selectRuntimeType`** — New runtime selection abstraction. (#14428) - **pnpm v11 migration** — Workspace consolidated. (#14316) - **Browser-compatible frontmatter parser** — Replaces `gray-matter`. (#14435) --- ## 📱 Platforms & Integrations - **Messager support** — New messager package wired into the chat surface. (#14442) - **Messenger DB tables** — IM bot integration gains its persistence layer. (#14496) - **Line bot** — Initial Line support and downstream optimization. (#14207, #14448) - **DM pair policy** — Group/DM pair-based delivery. (#14211) - **Telegram API restored** — Missing Telegram API path reconnected. (#14519) - **xAI Responses tools stabilized** — Plus unsupported parameter handling. (#14462, #14445) - **Volcengine websearch via ResponseAPI** — Built-in websearch for Volcengine. (#14216) --- ## 🤖 Models & Providers - **DeepSeek V4 Pro default for OSS** — OSS distribution defaults to DeepSeek V4 Pro. (#14555) - **DeepSeek Anthropic runtime** — Anthropic-shape runtime support for DeepSeek. (#14312) - **GPT-5.5 / GPT-5.5 Pro** — New OpenAI tier. (#14142) - **Grok 4.20 / Grok 4.3 / LobeHub-hosted Grok 4.3** — (#14253, #14382, #14446) - **Gemma 4 + provider settings normalization** — (#13313) - **gpt-image-2 + step-image-edit-2** — (#14253, #14329) - **Model bank refresh + original-pricing display** — Batch model updates and pricing surfaces. (#14070, #14391) - **Hunyuan migrated to TokenHub for Hy3 Preview** — (#14108) - **Reject lobehub model ids no longer in the bank** — (#14261) - **Hide runtime-only aliases** — Runtime-only model aliases no longer leak into the model picker. (#14552) --- ## 🖥️ User Experience ### Onboarding - **Shared prefix steps** — Language and privacy extracted as shared prefix steps. (#14538) - **Identity intervention card simplified** — Plus tool result renders cleanup. (#14505, #14506) - **Welcome polish + web-onboarding tool UI** — (#14475) - **Templates fetched from market API** — (#14286) - **Virtual model id for default onboarding model** — (#14311) - **Skip / mode-switch footer behind feature flag** — Footer guarded for desktop and web initialization. (#14560) ### Home & navigation - **Home recents performance** — Recents refresh periodically and inline task status; brief and task-template fetch overhead trimmed. (#14518, #14516) - **Home refactor + skill-connect recommendations** — Restructured home with skill-connect recommendation system. (#14266, #14214) - **Tasks in agent sidebar** — Tasks moved from welcome card into the sidebar list. (#14500) - **Sidebar collapse persists** — Home sidebar collapse state stored. (#14473) - **Agent-specific topic grouping** — Plus improved empty state and agent identity in topic search. (#14225) - **MentionMenu scroll fix** — Mention menu no longer clips inside chat input. (#14533) ### Conversation & chat - **Follow-up chips fill input** — Clicking a follow-up chip now fills the input instead of sending immediately. (#14536) - **Quick-reply chips below assistant messages** — (#14350) - **Inline single-tool assistant group + leading sentence promotion** — (#14244) - **Assistant-group rendering** — Per-segment content overrides flow into MessageContent. (#14504) - **Tool call timer fix** — Timer no longer resets when tool calls collapse or expand. (#14513) - **Streaming re-render reduction** — Reference stabilization and self-subscribing components. (#14470) - **Topic chat drawer feedback input** — (#14392) ### Skills, agents, devtools - **Managed skill folders** — Agent view displays managed skill folders and aligns delete confirmations. (#14553) - **Review tab + bulk git diffs** — New Review tab with bulk diffs; gating uses effective working directory. (#14334, #14512) - **Devtools gallery rebuild** — Plus Review polish, queue-tray images. (#14423) - **Agent mock devtools** — Playback & fixture viewer. (#14436) ### Desktop & CLI - **App tray visibility setting** — (#14463) - **Notification settings in desktop** — (#14491) - **Multimodal input across CLI / shared spawn / desktop** — (#14433) - **CLI bot + userId guide** — (#14258) --- ## 🔧 Tooling - **Visual analysis tool** — New visual understanding tool with flattened schema. (#14378, #14550) - **GitHub marketplace tool UI** — (#14420) - **Drop "Local" prefix and `____builtin` suffix from tool names** — (#14364, #14289) - **Sanitize provider tool names** — Avoids invalid characters from external providers. (#14510) - **Generation moderation context** — Moderation context passed through the generation pipeline. (#14541) - **Visual analysis trigger tracking** — (#14399) - **Claude thinking signature sanitization** — History signatures sanitized when replaying Claude conversations. (#14499) - **Responses input media sanitization** — Assistant media sanitized in Responses input. (#14497) --- ## 🔒 Security & Reliability - **Security:** Removed the `/webapi/proxy` route and dead URL-manifest plugin code to shrink the SSRF surface. (#14549) - **Security:** Sessions revoked after password reset. (#14424) - **Reliability:** Added `prompt_cache_key` to OpenAI chat requests for stable cache hits. (#14349) - **Reliability:** `onFinish` now fires even when the browser tab is backgrounded mid-SSE stream. (#14461) - **Reliability:** Better-auth session refetch preserves user fields rather than overwriting them. (#14531) - **Reliability:** User-memory queries sanitize backticks; user-memory errors now explicitly injected so failures stay visible. (#14524, #14525) - **Reliability:** Auth captcha retries handled; input loading unsticks on `auth_failed` and recoverable `auth_expired`. (#14346, #14419) - **Reliability:** Trace snapshot finalized on error path. (#14440) - **Reliability:** Drop `switchTopic` race under rapid sidebar clicks. (#14115) - **Reliability:** PDF chunking logic fixed to prevent vectorization failure. (#14327) - **Performance:** Marketplace fork uses a batched API for parallel installs. (#14537) - **Performance:** Review tab open latency cut ~9× on large dirty trees. (#14338) --- ## 👥 Contributors Huge thanks to **18 contributors** who shipped **236 merged PRs** this cycle. @hezhijie0327 · @sxjeru · @yueyinqiu · @octo-patch · @hardy-one · @Coooolfan · @CanYuanA · @BillionClaw · @arvinxx · @tjx666 · @Innei · @neko · @AmAzing129 · @rdmclin2 · @lijian · @sudongyuer · @rivertwilight · @cy948 Plus @lobehubbot for i18n and translation maintenance. --- **Full Changelog**: v2.1.56...release/weekly-20260509

Vercel serverless functions are stateless per-request, so `operationStates` is empty on every `heteroIngest` call. loadOrCreateState always cold-creates. #14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`, `toolState.payloads`, and `toolState.persistedIds` empty on cold load, causing two bugs: - Content truncation: cold instance starts with `accumulatedContent=''`, accumulates only the current batch's text, then writes that shorter string on the next step boundary or terminal — overwriting the longer content the previous write had already stored in DB. - Tool duplication / tools[] overwrite: `persistedIds={}` on cold load means every `tools_calling` event re-creates already-persisted tool messages, and `payloads=[]` means phase 1/3 writes only the current batch's tools, wiping previous tools from `assistant.tools[]`. Fix: in `loadOrCreateState`, fetch the current assistant message and restore `accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and `toolState.persistedIds` from it. Cold load is now equivalent to warm load. Also adds two regression tests covering the cold-replica scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ingStatus (#14592) * 🐛 fix: skip reconnect when gateway action already established a connection Race condition on new-topic first message: 1. switchTopic loads runningOperation → useGatewayReconnect fires 2. executeGatewayAgent calls connectToGateway (status: connecting) 3. reconnectToGatewayOperation overwrites with resumeOnConnect:true 4. Gateway sees resume on a brand-new session → no events → stuck Second message works because the client store's runningOperation is stale (from the first op), so SWR deduplications and no reconnect fires. Fix: bail out of reconnectToGatewayOperation if gatewayConnections already shows connecting/connected for that operationId. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume CC stores session files at ~/.claude/projects/<encoded-cwd>/. Without an explicit --cwd the actual working directory can differ between sandbox invocations, so --resume <heteroSessionId> fails to locate the previous session files even though the container is persistent and the ID is correctly stored in topic.metadata. Default cwd to /workspace for cloud runs (desktop keeps its own explicit path), guaranteeing a stable session-file location across page reloads within the same sandbox lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: extend reconnect guard to cover all in-flight connection statuses The previous guard only skipped reconnect for 'connecting'/'connected' but the connection can already be in 'authenticating' or 'reconnecting' by the time useGatewayReconnect fires, leaving the race window open. Flip the condition: skip for any status that is not 'disconnected'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler Vercel serverless functions are stateless per-request, so `operationStates` is empty on every `heteroIngest` call. loadOrCreateState always cold-creates. #14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`, `toolState.payloads`, and `toolState.persistedIds` empty on cold load, causing two bugs: - Content truncation: cold instance starts with `accumulatedContent=''`, accumulates only the current batch's text, then writes that shorter string on the next step boundary or terminal — overwriting the longer content the previous write had already stored in DB. - Tool duplication / tools[] overwrite: `persistedIds={}` on cold load means every `tools_calling` event re-creates already-persisted tool messages, and `payloads=[]` means phase 1/3 writes only the current batch's tools, wiping previous tools from `assistant.tools[]`. Fix: in `loadOrCreateState`, fetch the current assistant message and restore `accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and `toolState.persistedIds` from it. Cold load is now equivalent to warm load. Also adds two regression tests covering the cold-replica scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ingStatus (lobehub#14592) * 🐛 fix: skip reconnect when gateway action already established a connection Race condition on new-topic first message: 1. switchTopic loads runningOperation → useGatewayReconnect fires 2. executeGatewayAgent calls connectToGateway (status: connecting) 3. reconnectToGatewayOperation overwrites with resumeOnConnect:true 4. Gateway sees resume on a brand-new session → no events → stuck Second message works because the client store's runningOperation is stale (from the first op), so SWR deduplications and no reconnect fires. Fix: bail out of reconnectToGatewayOperation if gatewayConnections already shows connecting/connected for that operationId. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume CC stores session files at ~/.claude/projects/<encoded-cwd>/. Without an explicit --cwd the actual working directory can differ between sandbox invocations, so --resume <heteroSessionId> fails to locate the previous session files even though the container is persistent and the ID is correctly stored in topic.metadata. Default cwd to /workspace for cloud runs (desktop keeps its own explicit path), guaranteeing a stable session-file location across page reloads within the same sandbox lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: extend reconnect guard to cover all in-flight connection statuses The previous guard only skipped reconnect for 'connecting'/'connected' but the connection can already be in 'authenticating' or 'reconnecting' by the time useGatewayReconnect fires, leaving the race window open. Flip the condition: skip for any status that is not 'disconnected'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler Vercel serverless functions are stateless per-request, so `operationStates` is empty on every `heteroIngest` call. loadOrCreateState always cold-creates. lobehub#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`, `toolState.payloads`, and `toolState.persistedIds` empty on cold load, causing two bugs: - Content truncation: cold instance starts with `accumulatedContent=''`, accumulates only the current batch's text, then writes that shorter string on the next step boundary or terminal — overwriting the longer content the previous write had already stored in DB. - Tool duplication / tools[] overwrite: `persistedIds={}` on cold load means every `tools_calling` event re-creates already-persisted tool messages, and `payloads=[]` means phase 1/3 writes only the current batch's tools, wiping previous tools from `assistant.tools[]`. Fix: in `loadOrCreateState`, fetch the current assistant message and restore `accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and `toolState.persistedIds` from it. Cold load is now equivalent to warm load. Also adds two regression tests covering the cold-replica scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ingStatus (#14592) * 🐛 fix: skip reconnect when gateway action already established a connection Race condition on new-topic first message: 1. switchTopic loads runningOperation → useGatewayReconnect fires 2. executeGatewayAgent calls connectToGateway (status: connecting) 3. reconnectToGatewayOperation overwrites with resumeOnConnect:true 4. Gateway sees resume on a brand-new session → no events → stuck Second message works because the client store's runningOperation is stale (from the first op), so SWR deduplications and no reconnect fires. Fix: bail out of reconnectToGatewayOperation if gatewayConnections already shows connecting/connected for that operationId. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume CC stores session files at ~/.claude/projects/<encoded-cwd>/. Without an explicit --cwd the actual working directory can differ between sandbox invocations, so --resume <heteroSessionId> fails to locate the previous session files even though the container is persistent and the ID is correctly stored in topic.metadata. Default cwd to /workspace for cloud runs (desktop keeps its own explicit path), guaranteeing a stable session-file location across page reloads within the same sandbox lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: extend reconnect guard to cover all in-flight connection statuses The previous guard only skipped reconnect for 'connecting'/'connected' but the connection can already be in 'authenticating' or 'reconnecting' by the time useGatewayReconnect fires, leaving the race window open. Flip the condition: skip for any status that is not 'disconnected'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler Vercel serverless functions are stateless per-request, so `operationStates` is empty on every `heteroIngest` call. loadOrCreateState always cold-creates. #14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`, `toolState.payloads`, and `toolState.persistedIds` empty on cold load, causing two bugs: - Content truncation: cold instance starts with `accumulatedContent=''`, accumulates only the current batch's text, then writes that shorter string on the next step boundary or terminal — overwriting the longer content the previous write had already stored in DB. - Tool duplication / tools[] overwrite: `persistedIds={}` on cold load means every `tools_calling` event re-creates already-persisted tool messages, and `payloads=[]` means phase 1/3 writes only the current batch's tools, wiping previous tools from `assistant.tools[]`. Fix: in `loadOrCreateState`, fetch the current assistant message and restore `accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and `toolState.persistedIds` from it. Cold load is now equivalent to warm load. Also adds two regression tests covering the cold-replica scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ingStatus (lobehub#14592) * 🐛 fix: skip reconnect when gateway action already established a connection Race condition on new-topic first message: 1. switchTopic loads runningOperation → useGatewayReconnect fires 2. executeGatewayAgent calls connectToGateway (status: connecting) 3. reconnectToGatewayOperation overwrites with resumeOnConnect:true 4. Gateway sees resume on a brand-new session → no events → stuck Second message works because the client store's runningOperation is stale (from the first op), so SWR deduplications and no reconnect fires. Fix: bail out of reconnectToGatewayOperation if gatewayConnections already shows connecting/connected for that operationId. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: always pass --cwd /workspace for cloud CC to ensure session resume CC stores session files at ~/.claude/projects/<encoded-cwd>/. Without an explicit --cwd the actual working directory can differ between sandbox invocations, so --resume <heteroSessionId> fails to locate the previous session files even though the container is persistent and the ID is correctly stored in topic.metadata. Default cwd to /workspace for cloud runs (desktop keeps its own explicit path), guaranteeing a stable session-file location across page reloads within the same sandbox lifecycle. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: extend reconnect guard to cover all in-flight connection statuses The previous guard only skipped reconnect for 'connecting'/'connected' but the connection can already be in 'authenticating' or 'reconnecting' by the time useGatewayReconnect fires, leaving the race window open. Flip the condition: skip for any status that is not 'disconnected'. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * 🐛 fix: restore cold replica state in HeterogeneousPersistenceHandler Vercel serverless functions are stateless per-request, so `operationStates` is empty on every `heteroIngest` call. loadOrCreateState always cold-creates. lobehub#14539 fixed `toolMsgIdByCallId` restoration but left `accumulatedContent`, `toolState.payloads`, and `toolState.persistedIds` empty on cold load, causing two bugs: - Content truncation: cold instance starts with `accumulatedContent=''`, accumulates only the current batch's text, then writes that shorter string on the next step boundary or terminal — overwriting the longer content the previous write had already stored in DB. - Tool duplication / tools[] overwrite: `persistedIds={}` on cold load means every `tools_calling` event re-creates already-persisted tool messages, and `payloads=[]` means phase 1/3 writes only the current batch's tools, wiping previous tools from `assistant.tools[]`. Fix: in `loadOrCreateState`, fetch the current assistant message and restore `accumulatedContent`, `accumulatedReasoning`, `toolState.payloads`, and `toolState.persistedIds` from it. Cold load is now equivalent to warm load. Also adds two regression tests covering the cold-replica scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 8, 2026

sourcery-ai Bot reviewed May 8, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

Comment thread src/server/services/heterogeneousAgent/HeterogeneousPersistenceHandler.ts Outdated

Comment thread src/server/services/heterogeneousAgent/HeterogeneousPersistenceHandler.ts Outdated

vercel Bot deployed to Preview May 8, 2026 17:29 View deployment

fix: add the test fixed

6d5a96a

vercel Bot deployed to Preview May 9, 2026 04:42 View deployment

ONLY-yours commented May 9, 2026

View reviewed changes

fix: slove the some topic problem

e66c871

vercel Bot deployed to Preview May 9, 2026 05:09 View deployment

ONLY-yours merged commit d0b938a into canary May 9, 2026
47 checks passed

ONLY-yours deleted the fix/hetero-persistence-multi-replica branch May 9, 2026 06:36

Innei mentioned this pull request May 9, 2026

🚀 release: 20260509 #14563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 fix: recover hetero persistence state across Vercel replicas#14539

🐛 fix: recover hetero persistence state across Vercel replicas#14539
ONLY-yours merged 3 commits into
canaryfrom
fix/hetero-persistence-multi-replica

ONLY-yours commented May 8, 2026

Uh oh!

vercel Bot commented May 8, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026 •

edited

Loading

Uh oh!

ONLY-yours May 9, 2026 •

edited by lobehubbot

Loading

Uh oh!

claude Bot May 9, 2026 •

edited

Loading

Uh oh!

ONLY-yours May 9, 2026 •

edited by lobehubbot

Loading

Uh oh!

claude Bot May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

维度	DB查询方案	Redis方案
冷启动代价	1次 DB 查询（`SELECT` by topic）	1次 Redis 读
热路径代价	0（纯内存）	每个 tool_use 1次 Redis 写
基础设施依赖	已有 DB	新增 Redis
一致性保证	DB 即 source of truth，无需同步	需要双写 + 失败处理
实现复杂度	低	高（TTL / fallback / 原子操作）

问题	状态	说明
P1: heteroCurrentMsgId 无 operationId scope	✅ 完整修复	改为 `{ msgId, operationId }` 结构，读取时做 operationId 校验
P2: 步骤边界写入非幂等	✅ 实质修复	内存推进移到 DB 写之后，链路不再被破坏；孤儿空行是已知权衡

Uh oh!

Conversation

ONLY-yours commented May 8, 2026

💻 Change Type

🔗 Related Issue

🔀 Description of Change

🧪 How to Test

📝 Additional Information

Uh oh!

vercel Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ONLY-yours May 9, 2026 • edited by lobehubbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

DB查询 vs Redis 方案开销对比分析

两种方案的开销分析

方案A：当前实现 — listMessagePluginsByTopic DB查询

方案B：外置 Redis

关键结论：哪个开销更大？

我的建议：保留 DB 查询方案

Uh oh!

ONLY-yours May 9, 2026 • edited by lobehubbot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

两个 Codex 问题分析

P1：已完整修复 ✅

P2：实质性修复 ✅（有意识的权衡）

总结

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 8, 2026 •

edited

Loading

codecov Bot commented May 8, 2026 •

edited

Loading

ONLY-yours May 9, 2026 •

edited by lobehubbot

Loading

claude Bot May 9, 2026 •

edited

Loading

方案A：当前实现 — `listMessagePluginsByTopic` DB查询

ONLY-yours May 9, 2026 •

edited by lobehubbot

Loading

claude Bot May 9, 2026 •

edited

Loading