⚡️ perf(agent-tracing): zstd-compress S3 snapshots#14807
Conversation
Compress operation snapshots with zstd (level 3) before uploading to S3 and write them under a `.json.zst` key. Measured on 76839 production snapshots: 217 GB → 25.8 GB (8.4× average ratio, p99 47×). New uploads only; old `.json` objects are left as-is. The `.zst` suffix is the format indicator; Content-Encoding is intentionally omitted so the object is served as opaque bytes and readers decompress explicitly (avoids surprise behavior from HTTP clients that negotiate zstd). Uses Node's built-in zstd (node:zlib, available since Node 22.15) so no new runtime dependency is added. Reader updates: - RemoteSnapshotStore.fetch decompresses the downloaded payload; local cache stays as plain `.json` for easy inspection. - buildRemoteUrl now points at `.json.zst`. - S3SnapshotStore.loadPartial falls back to the legacy `.json` key so in-flight QStash operations spanning the deploy keep working; the fallback dies off naturally once partials finalize. - removePartial deletes both keys for clean transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c78fbcc9c1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import fs from 'node:fs/promises'; | ||
| import path from 'node:path'; | ||
| import { promisify } from 'node:util'; | ||
| import { zstdDecompress } from 'node:zlib'; |
There was a problem hiding this comment.
Keep the Bun CLI from importing Node-only zstd
When the package bin is run as intended through Bun (packages/agent-tracing/src/cli/index.ts has #!/usr/bin/env bun), this named node:zlib import is evaluated during CLI startup via the inspect command registration. Bun's node:zlib shim does not export zstdDecompress, so the CLI fails before any command can run, including local-only commands that never fetch a remote snapshot. Use a Bun-compatible decompression path or defer the Node-only import to a Node runtime/subprocess so existing agent-tracing commands still launch.
Useful? React with 👍 / 👎.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## canary #14807 +/- ##
=========================================
Coverage 65.86% 65.87%
=========================================
Files 2953 2954 +1
Lines 260675 260748 +73
Branches 30762 25455 -5307
=========================================
+ Hits 171706 171779 +73
Misses 88812 88812
Partials 157 157
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Local dev (including ENABLE_AGENT_S3_TRACING=1 for S3 testing) keeps writing plain `.json` so devs can inspect bucket payloads directly. Only production deployments (NODE_ENV=production) compress + use the `.json.zst` suffix. Readers no longer assume the URL suffix matches the body format — they sniff the zstd frame magic (0x28b52ffd) and decode accordingly. This way prod-written `.json.zst` and dev-written `.json` round-trip through the same code path regardless of which environment reads. S3SnapshotStore.loadPartial tries the active suffix first then the sibling format; removePartial cleans up both. RemoteSnapshotStore.fetch falls back from `.json.zst` to plain `.json` on 404 so dev-uploaded snapshots stay inspectable from another machine via the CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…duction" This reverts commit 70d0b3d.
… fallback 9 vitest cases mocking FileS3: - save() → key ends in .json.zst, body starts with zstd magic, decompresses to original snapshot - save() → falls back to "unknown" for missing agentId / topicId - savePartial() → writes to _partial/ with zstd body - loadPartial() → decodes .json.zst happy path - loadPartial() → falls back to legacy .json on miss - loadPartial() → returns null when neither key exists - removePartial() → deletes both .json.zst and .json - removePartial() → swallows individual delete failures (allSettled) - get/getLatest/list/listPartials → return null/[] (OTEL owns querying) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ote snapshots (#14826) 🐛 fix(agent-tracing): restore legacy .json fallback in RemoteSnapshotStore.fetch After #14807, `buildRemoteUrl` always targets `.json.zst` and `RemoteSnapshotStore.fetch` throws on any non-OK response. Because the S3 rollout only compresses new uploads — pre-rollout final snapshots remain at the legacy `.json` key — every pre-rollout operation ID would 404 through the CLI/viewer. Mirror the fallback that `S3SnapshotStore.loadPartial` already uses: try `.json.zst` first, fall back to the sibling `.json` on non-OK, and sniff the zstd frame magic (0x28b52ffd) on the body so decoding is content-driven rather than suffix-driven. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4860) 🐛 fix(agent-tracing): align DB trace_s3_key with `.json.zst` suffix PR #14807 switched the S3 object key written by `S3SnapshotStore.save()` to `.json.zst` but the DB-persistence path in `CompletionLifecycle.ts` still hardcoded `.json`. Result: every row inserted into `agent_operations.trace_s3_key` points at a key that does not exist — the actual object is the `.json.zst` sibling. Any consumer that GETs by the DB-recorded key (dc tracing UI, agent-tracing inspect via record lookup) hits 404. Verified in prod: 87012/87159 populated rows still end in `.json`, 0 end in `.json.zst`, including rows inserted hours after the PR #14807 deploy. Fix factors out a single `buildFinalSnapshotKey(agentId, topicId, opId)` helper exported from `@/server/modules/AgentTracing` so both the S3 writer and the DB writer construct the key from the same source, making this class of drift impossible going forward. Existing rows need a one-off backfill (run from dc): UPDATE agent_operations SET trace_s3_key = trace_s3_key || '.zst' WHERE trace_s3_key LIKE '%.json'; Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# 🚀 LobeHub Release (20260518) **Release Date:** May 18, 2026 **Since v2.1.58:** 208 merged PRs · 209 commits · 16 contributors > v2.2.0 introduces the **Chief Agent Operator** — an agent that runs itself end-to-end. It self-iterates against its own output, assembles sub-agent teams on demand through the heterogeneous runtime, and drives a unified task system that knows when to pause for a human. Self-review, AssistantGroup, and tasks/scheduling all converge into one operator surface. --- ## ✨ Highlights ### 🎩 Chief Agent Operator - **Self-iteration exits Lab** — Agent Signal's self-review pipeline ships proposal actions straight into briefs and auto-executes the approved follow-ups, with prompts hardened against eval. The operator now critiques and re-runs its own work without a human in the loop. (#14769, #14583, #14647, #14882) - **Auto-formed agent teams** — Heterogeneous AssistantGroup gains Monitor-style signal callbacks, read-only SubAgent threads with breadcrumb headers, and a thread switcher. The operator dispatches sub-agents and you can step into any branch to see what the team is doing. (#14859, #14658, #14845, #14715) - **Task system as the operator's runway** — Claude Code surfaces task tools, AskUserQuestion freeform notes, and a dedicated `waitingForHuman` topic status; `lobe-task` exposes `setTaskSchedule`; the scheduler is hardened (maxExecutions cap, sub-10min heartbeat block, race-free SchedulerForm). Long-running operator runs no longer go silent and stop themselves when human input is needed. (#14870, #14639, #14713, #14865, #14853) ### 🚀 Cloud & runtime - **Cloud Claude Code V3** — Repo picker, GitHub token flow, and sandbox-aware context bring cloud-hosted Claude Code to feature parity with local; cloud sandbox completion now triggers the task lifecycle end-to-end. (#14568, #14822, #14681) - **Heterogeneous agent multi-replica safety** — Subagent threads, ingest refresh, and parallel-tool counts now survive replica swaps without losing parent_id or rolling back tool state. (#14897, #14631, #14806, #14838) - **Built-in tool lifecycle hooks** — `onBeforeCall` / `onAfterCall` land on the built-in tool runtime; sub-agent dispatch moves to `lobe-agent`; self-iteration aligns with the shared inspector pattern. (#14719, #14715, #14827) - **Knowledge base RAG unified** — Client and server share one `KnowledgeBaseSearchService`; KB files preserved on `NoSuchKey` instead of silently lost. (#14673, #14501) ### 💬 Workspace experience - **Home daily brief + recommendations** — The home screen opens with a linkable welcome, paired input hint, and a recommendations module sourced from the operator's hetero action library. (#14589, #14645, #14770) - **Chat mode + redesigned action bar** — The chat input gains a Chat/Agent mode toggle and a re-pitched action bar with icon-and-color action tag chips. (#14774, #14903, #14846) - **Documents tree, optimistic** — Document tree creates, deletes, and inline renames now apply optimistically; the agent-documents index hides web crawls and switches to a table layout. (#14714, #14292) - **Branded MCP inspectors** — Linear MCP tool calls render with the same branded inspector as the built-in Linear skill; CC MCP and built-in skills now share inspector code. (#14864, #14884) - **Bot identity gating** — Device tools are gated by sender identity, the activator bypass is closed, and Slack mpim plus Discord DM regressions are fixed. (#14634, #14664, #14733) --- ## 🏗️ Core Agent & Signal Pipeline ### Self-iteration & Agent Signal - Self-iteration graduates out of Lab, with service, tool, name, and concept structure unified across `agent-signal`, `prompts`, `database`, and `builtin-tool-self-iteration`. (#14699, #14769) - Self-review now proposes actions to briefs and auto-executes the approved set, with eval-verified prompt hardening. (#14583, #14657, #14647) - Self-iteration built-in tool aligns with the shared runtime + inspector patterns. (#14827) - Agent Signal prompts adapt their response language and avoid blocking agent execution. (#14890, #14775, #14882) - Receipt descriptions now carry an Agent Signal marker, and self-review hinted skill documents route correctly. (#14764, #14895) ### Heterogeneous agent runtime - Subagent threads render read-only with a breadcrumb header and thread switcher; SUBAGENT badge dropped, indentation tightened. (#14658, #14845, #14783) - Multi-replica safety: ingest refresh restores tools/model from DB to fix parent_id breaks; new-step assistants sync across replicas; subagent-tagged events no longer leak into the main gateway handler. (#14897, #14631, #14838) - Fetch-triggering events are deferred to keep parallel tool counts from rolling back. (#14806) - AskUserQuestion is wired for Claude Code, with auto-decline disabled and a freeform note input on the cloud side; `waitingForHuman` is a first-class topic status. (#14639, #14629, #14870) - AssistantGroup gains Monitor-style signal callbacks; project skills surface in the working sidebar and markdown preview. (#14859, #14896) - Cloud Claude Code V3 — repo picker, GitHub token, sandbox context; credentials alert and disabled input when not configured. (#14568, #14822) - Cloud sandbox completion now triggers the task lifecycle end-to-end. (#14681) ### Agent runtime & context engine - Built-in tool runtime gets `onBeforeCall` / `onAfterCall` lifecycle hooks. (#14719) - `CompletionLifecycle`, `HumanInterventionHandler`, and `stepPresentation` are extracted from the runtime monolith. (#14441) - Per-tool timeout is honored end-to-end for client tool dispatch. (#14817) - Compression budget accounts for `tool_calls`, reasoning content, and tool defs; `call_llm` forwards tools into the budget. (#14813, #14837) - Pre-flight context check now fails fast for OpenAI-compatible providers. (#14824) - Malformed `tool_call` names are recovered instead of finishing the step silently. (#14577) - Sub-agent dispatch moves from `lobe-gtd` to `lobe-agent`. (#14715) - Hidden built-in tools now appear in the system prompt @-mention list. (#14823) ### Agent tracing & operations - New `agent_operations` table and runtime persistence for every hetero-agent operation. (#14416, #14736) - `signOperationJwt` issues 4-hour signed operation tokens. (#14586) - S3 trace snapshots are zstd-compressed; DB `trace_s3_key` aligns with the `.json.zst` suffix; legacy `.json` fallback preserved on fetch. (#14807, #14860, #14826) --- ## 📱 Platform & Integrations ### Bot / Channels - Device tools are gated by sender identity. (#14634) - Activator bypass closed and device-access checks converged. (#14664) - Slack mpim supported; Discord DM regression fixed; Slack connect + slash commands repaired. (#14733, #14591) - Bot channels, bot watch, bot callback service, and system bot reliability fixes. (#14847, #14796, #14570, #14784, #14649) - Online Messager scaffolding. (#14755) ### Onboarding - Home daily brief with linkable welcome and paired input hint. (#14589) - Recommendations module sourced from the hetero agent action library. (#14645) - Chat onboarding passes request triggers via metadata and preserves the resume request. (#14770, #14798) - Discovery turn progress gated by phase, with a reminder on stalled discovery. (#14842, #14833) - FullNameStep back button rejoins the shared prefix; ModeSwitch hidden in production. (#14898, #14760) - Agent marketplace folds into the web onboarding tool. (#14578, #14672) - Onboarding interests stored as keys instead of free text; early-exit skips marketplace and drops CJK prompts. (#14624, #14598) ### Model providers - Gemini 3.1 Flash-Lite cards; Gemini schema sanitizer drops non-compliant `enum` / `required`; zero `cachedContentTokenCount` handled in usage conversion. (#14604, #14740, #14567) - DeepSeek-V4 model cards and pricing restored to official rates. (#14110, #14911) - ernie-5.1 and spark-x2-flash support; Grok 4.3 `reasoning_effort` support. (#14643, #14731, #14642) - SiliconCloud catalog synced with API; duplicates removed; reasoning params adjusted. (#14464) - Minimax derives `max_tokens` from context window to avoid `ExceededContextWindow`. (#14814) - aihubmix uses the full models endpoint for a complete list; stale empty-apiKey test dropped. (#14511, #14669) - Stream parse errors are enriched with provider + model context. (#14636) - Visual content parts are consumed in the server runtime; video image references move to a JSON object. (#14637, #14900) - Google function call magic `thoughtSignature` now attaches to every part, not just the last turn. (#14904) - Service model assignments settings added; model extend-param options removed. (#14712, #14607) ### Built-in tools & knowledge base - `lobe-task` exposes `setTaskSchedule`; task scheduler hardened (maxExecutions cap, sub-10min heartbeat blocked, SchedulerForm race fix, rapid automation-mode toggle stabilized). (#14713, #14865, #14853, #14801) - KnowledgeBaseSearchService shares RAG runtime across client and server. (#14673) - KB files preserved on `NoSuchKey` and orphan documents/tasks cleaned. (#14501) - Document tree gets optimistic create/delete + inline rename. (#14714) - agent-documents index hides web crawls and switches to a table layout. (#14292) - `lobe-clarify` and SKILL.md frontmatter parsing/edit validation are unified. (#14566) - AnalyzeVisualMedia inspector + Portal HTML preview refactor; HTML preview restored for AssistantGroup messages. (#14777, #14811) - Branded inspector shared between CC MCP and built-in Linear skill. (#14884, #14864) --- ## 🖥️ CLI & User Experience ### Chat & Conversation - Chat mode toggle and redesigned chat input action bar. (#14774) - Action tag chips switch to icon + colored label; ActionDropdown closes on sibling-open and focus-out; submenu uses native header/footer slots. (#14903, #14802, #14901) - Action bar padding equalized around the send button; skeleton shows in action bar while config loads. (#14846, #14656) - `useCmdEnterToSend` is respected in thread & task inputs; send button enables after pasting into thread/comment input. (#14850, #14816) - TopicChatDrawer state preserved during close animation. (#14803) - Only the last assistant block animates during markdown streaming. (#14906) - Right working panel no longer auto-collapses on chat mount; home agent config fetched so knowledge toggles reflect in UI. (#14883, #14834) ### Tasks - Task scheduler, hotkey, comment, and TodoList polish. (#14707) - Add Subtask button & card baseline aligned; activity card stop run; task agent manager polish. (#14848, #14559, #14569) - Task template skeleton CLS reduced; task page placeholder copy refreshed. (#14788, #14704) - Task agent model snapshotted into `task.config` at create time. (#14670) - User-feedback card, task card polish, and Run-now context menu in markdown. (#14727) - Inline skill auth in recommended task templates. (#14676) ### Navigation & Layout - Tab bar gains a Chrome-style divider between inactive tabs. (#14892) - SideBarDrawer & header layout polish; nav ActionIcon sizing unified; TodoList encapsulation improved. (#14762, #14692) - Desktop header icons, sidebar density, and task menus polished. (#14724) - Standardized header action icon sizes. (#14717) - Chat topic title length increased; copy session ID added to topic dropdown menu. (#14659, #14595) - Heterogeneous agent topic rows regain indentation. (#14783) ### Other polish - Usage token details shortened; tool execution time formatted as `Xmin Ys`. (#14849, #14641) - Tool arguments display gets word-wrap toggle; long tool-call params wrap instead of truncate. (#14706, #14640) - Editor stops showing per-line placeholder once content is present. (#14852) - Visible divider between queued messages; intervention confirmation bar polished. (#14593, #14587) - Credit top-up copy refreshed; auth captcha retry copy refreshed; brief recommendations layout polished. (#14821, #14561, #14871) --- ## 🔧 Tooling & Developer Experience - Dev-only feature flag override panel. (#14565) - `__DEV__` define replaces `process.env.NODE_ENV` in the SPA. (#14696) - Agent-settings drops Meta/Documents tabs and restores `inputTemplate`. (#14874) - `local-system` forwards all `grepContent` params and moves the executor to `/client`. (#14888) - `lobe-task` and `setTaskSchedule` exposed. (#14713) - Memory user-memory benchmark agent config and source-id extraction schemas. (#14779, #14778) - CLI man page drops stale cron entry; `clearMessages` hotkey removed. (#14709, #14906) - Skill docs simplified; cloud heteroContext gains sandbox TTL + public-repo fork push guide. (#14785, #14761) --- ## 🔒 Security & Reliability - **Security:** Sensitive comments and examples sanitized from the production JS bundle. (#14557) - **Security:** Inactive OIDC access rejected. (#14674) - **Security:** CASC `new Function()` template replaced with safe string builders. (#14751) - **Security:** Sign-in captcha flow removed in favor of safer flow. (#14573) - **Security:** Desktop local file previews restricted to safe roots. (#14789) - **Security:** Image binary capped at 3.75 MB so base64 payload stays under the Anthropic 5 MB limit. (#14711) - **Reliability:** Neon/Node pools get error listeners to prevent Lambda crashes. (#14606) - **Reliability:** `paradedb.match(...)` replaces hardcoded normalizer in memory search. (#14590) - **Reliability:** `PlaceholderVariablesProcessor` errors carry diagnostic context. (#14741) - **Reliability:** File storage upload checks are serialized; multiple account link bug fixed. (#14829, #14562) - **Reliability:** `ScrollShadow` replaced with `ScrollArea` to fix a React infinite render loop (error code 185). (#14689) - **Reliability:** Embedding token cap enforced — long memory queries are limited and truncated before search. (#14757) - **Reliability:** Embed binary blob guard + oversized output cap in `local-system.readFile`. (#14602) - **Reliability:** Windows npm CLI shims resolved before spawning agents. (#14772, #14720) - **Reliability:** Vite pinned to 8.0.12 to avoid the rolldown 1.0.1 preload regression; desktop runtime externals split from native deps. (#14804, #14776) - **Reliability:** Old lobehub cron job removed; WeChat URL rules dropped from web crawler. (#14630, #14633) --- ## 👥 Contributors Huge thanks to **16 contributors** who shipped **208 merged PRs** this cycle. @hezhijie0327 · @sxjeru · @hardy-one · @Bianzinan · @brone1323 · @YuSaZh · @Wxh16144 · @arvinxx · @Innei · @tjx666 · @neko · @lijian · @rdmclin2 · @sudongyuer · @AmAzing129 · @rivertwilight Plus @lobehubbot for maintenance translations. --- **Full Changelog**: v2.1.58...v2.2.0
💻 Change Type
🔗 Related Issue
n/a (internal optimization)
🔀 Description of Change
Compress agent operation snapshots with zstd (level 3) before uploading to S3 and write them under a
.json.zstkey. Final snapshots and partial snapshots both go through this path.The S3 store is already gated by
ENABLE_AGENT_S3_TRACING=1upstream — local dev defaults toFileSnapshotStore(writes plain.jsonto local disk, untouched here). So whenever this code runs, we are by definition writing to a real bucket, and compression is always the right call.Measured on 76,839 production snapshots:
Storage cost drops by ~88%. New uploads only; existing
.jsonobjects are left as-is (we don't read them via this path anymore, per product decision).Why these specific knobs
node:zlib, so no new runtime dependency. Level 3 is the sweet spot (~200 MB/s, only marginal gain at higher levels for this data shape)..json.zstsuffix as the format indicator — the suffix tells readers to decompress.Content-Encoding: zstdis intentionally not set on the S3 object so the bytes are served as opaque binary; this avoids surprise behavior from HTTP clients that negotiate zstd (e.g. undici).application/zstdcontent type — registered MIME (RFC 8478).loadPartialtries.json.zstfirst and falls back to.jsonfor the duration of the deploy. Partials are deleted at finalization, so the legacy branch dies off as soon as the longest-running in-flight operation completes.removePartialcleans up both keys so nothing leaks.Files touched
src/server/modules/AgentTracing/S3SnapshotStore.ts— compress onsave/savePartial, decompress onloadPartial, key suffix change, legacy fallback for partials.packages/agent-tracing/src/store/remote-store.ts—buildRemoteUrlnow points at.json.zst;RemoteSnapshotStore.fetchdecompresses before parsing. Local CLI cache stays as plain.jsonfor easy inspection.🧪 How to Test
tmp-compressbenchmark in develop-center).bun run type-check).S3SnapshotStorehas no existing unit tests; behavior is exercised by the agent-tracing integration path.To validate post-deploy:
.json.zstkeys appear underagent-traces/{agentId}/{topicId}/.agent-tracing inspect <op_id>against a new operation → CLI should download, decompress, and render normally.📝 Additional Information
.jsonsnapshots remain readable directly from S3 if needed; they're just not surfaced throughbuildRemoteUrlanymore.agent_operations.trace_s3_keycolumn was never written; the key is always re-derived from operationId. Nothing to backfill.node:zlibzstd (Node 22.15+ / 24 in production).OperationTraceRecorderthat strips redundanteventspayloads from snapshots — together those two changes are expected to push average snapshot size down by >99% vs. the historical baseline.70d0b3d8addedNODE_ENV=productiongating + magic-byte sniffing — but sinceENABLE_AGENT_S3_TRACINGalready gates this code path entirely, the extra gating was redundant. Reverted ine9fc5a47.🤖 Generated with Claude Code