🐛 fix(web-crawler): cap response body size to prevent serverless OOM#14660
Conversation
Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch` where Node aborted with V8 "allocation failed" — the naive crawler buffered entire response bodies into heap before the 1 MB downstream truncation could apply, so a single large page (or a batch of three under default concurrency=3) could push rss past the lambda memory ceiling. - ssrfSafeFetch: add opt-in `maxContentLength` that streams the response body via `for await` and stops at the cap (soft truncation — still a successful response). Breaking the iterator destroys the underlying stream and releases the connection. Default behaviour (full `arrayBuffer()` read) unchanged when the option is absent. - naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond 1 MB is dropped at the network layer instead of being materialised in heap. - htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally block so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5c0dfc64ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import type { CrawlImpl, CrawlSuccessResult } from '../type'; | ||
| import { PageNotFoundError, toFetchError } from '../utils/errorType'; | ||
| import { htmlToMarkdown } from '../utils/htmlToMarkdown'; | ||
| import { htmlToMarkdown, MAX_HTML_SIZE } from '../utils/htmlToMarkdown'; |
There was a problem hiding this comment.
Export the mocked size constant in crawler tests
Importing MAX_HTML_SIZE here breaks the existing packages/web-crawler/src/crawImpl/__tests__/naive.test.ts mock for ../../utils/htmlToMarkdown, which only returns htmlToMarkdown. Under Vitest's factory mocks, the newly imported export is missing, so the naive crawler tests fail before exercising this code unless the mock also provides MAX_HTML_SIZE (or uses importOriginal).
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #14660 +/- ##
=========================================
Coverage 66.18% 66.18%
=========================================
Files 2897 2897
Lines 253594 253602 +8
Branches 24739 29868 +5129
=========================================
+ Hits 167831 167837 +6
- Misses 85611 85613 +2
Partials 152 152
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Verify that the maxContentLength cap actually prevents the production SIGABRT scenario, not just produces a truncated body. - Source-pull bound: a body source with 200 MB available, capped at 1 MB, must not be drained beyond ~1 MB. Asserts on bytes pulled from the generator, which is the property that prevents OOM. - Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three concurrent oversized fetches should pull at most ~3 MB total, not 300 MB. - Heap-delta bound (gated on --expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB. Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped by default so CI doesn't false-fail on GC timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…obehub#14660) * 🐛 fix(web-crawler): cap response body size to prevent serverless OOM Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch` where Node aborted with V8 "allocation failed" — the naive crawler buffered entire response bodies into heap before the 1 MB downstream truncation could apply, so a single large page (or a batch of three under default concurrency=3) could push rss past the lambda memory ceiling. - ssrfSafeFetch: add opt-in `maxContentLength` that streams the response body via `for await` and stops at the cap (soft truncation — still a successful response). Breaking the iterator destroys the underlying stream and releases the connection. Default behaviour (full `arrayBuffer()` read) unchanged when the option is absent. - naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond 1 MB is dropped at the network layer instead of being materialised in heap. - htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally block so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap Verify that the maxContentLength cap actually prevents the production SIGABRT scenario, not just produces a truncated body. - Source-pull bound: a body source with 200 MB available, capped at 1 MB, must not be drained beyond ~1 MB. Asserts on bytes pulled from the generator, which is the property that prevents OOM. - Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three concurrent oversized fetches should pull at most ~3 MB total, not 300 MB. - Heap-delta bound (gated on --expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB. Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped by default so CI doesn't false-fail on GC timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…14660) * 🐛 fix(web-crawler): cap response body size to prevent serverless OOM Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch` where Node aborted with V8 "allocation failed" — the naive crawler buffered entire response bodies into heap before the 1 MB downstream truncation could apply, so a single large page (or a batch of three under default concurrency=3) could push rss past the lambda memory ceiling. - ssrfSafeFetch: add opt-in `maxContentLength` that streams the response body via `for await` and stops at the cap (soft truncation — still a successful response). Breaking the iterator destroys the underlying stream and releases the connection. Default behaviour (full `arrayBuffer()` read) unchanged when the option is absent. - naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond 1 MB is dropped at the network layer instead of being materialised in heap. - htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally block so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap Verify that the maxContentLength cap actually prevents the production SIGABRT scenario, not just produces a truncated body. - Source-pull bound: a body source with 200 MB available, capped at 1 MB, must not be drained beyond ~1 MB. Asserts on bytes pulled from the generator, which is the property that prevents OOM. - Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three concurrent oversized fetches should pull at most ~3 MB total, not 300 MB. - Heap-delta bound (gated on --expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB. Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped by default so CI doesn't false-fail on GC timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# 🚀 LobeHub Release (20260513) **Hotfix Scope:** Ship the canary backlog (111 PRs) onto main as a fast-tracked patch — operator-focused, no weekly-style write-up. > Brings the accumulated canary work into main: agent/task improvements, hetero-agent fixes, desktop & onboarding polish, and several reliability caps. ## ✨ What's Included - **Agent & tasks** — Self-review proposal-to-action automation, sub-agent dispatch consolidated to `lobe-agent`, AskUserQuestion wiring for Claude Code, scheduler/hotkey/TodoList polish. (#14583, #14657, #14715, #14639, #14732, #14707, #14713) - **Home & onboarding** — Daily brief with linkable welcome + paired input hint, inline skill auth in recommended task templates, cleanup of captcha-on-signin and marketplace early-exit. (#14589, #14676, #14573, #14598) - **Bots & integrations** — Slack MPIM support, Discord DM fix, slash-command + connect-error fixes, gateway client-tool plugin state. (#14733, #14591, #14596) - **Desktop & CLI** — Windows `.cmd` shim detection for `claude` / `codex` CLIs, auth focus & pending-login reset fixes. (#14720, #14694, #14695) - **Reliability** — Cap web-crawler body size and image binary at safe limits, attach error listeners to Neon/Node pools, reject inactive OIDC access. (#14660, #14711, #14606, #14674) - **Database** — `agent_operations` table + persist agent operations from the runtime; switch user memory search to `paradedb.match(...)`. (#14416, #14736, #14590) ## ⚙️ Upgrade - **Self-hosted:** pull the latest image and restart. Drizzle migrations (including the new `agent_operations` table) run automatically on boot.
…obehub#14660) * 🐛 fix(web-crawler): cap response body size to prevent serverless OOM Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch` where Node aborted with V8 "allocation failed" — the naive crawler buffered entire response bodies into heap before the 1 MB downstream truncation could apply, so a single large page (or a batch of three under default concurrency=3) could push rss past the lambda memory ceiling. - ssrfSafeFetch: add opt-in `maxContentLength` that streams the response body via `for await` and stops at the cap (soft truncation — still a successful response). Breaking the iterator destroys the underlying stream and releases the connection. Default behaviour (full `arrayBuffer()` read) unchanged when the option is absent. - naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond 1 MB is dropped at the network layer instead of being materialised in heap. - htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally block so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap Verify that the maxContentLength cap actually prevents the production SIGABRT scenario, not just produces a truncated body. - Source-pull bound: a body source with 200 MB available, capped at 1 MB, must not be drained beyond ~1 MB. Asserts on bytes pulled from the generator, which is the property that prevents OOM. - Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three concurrent oversized fetches should pull at most ~3 MB total, not 300 MB. - Heap-delta bound (gated on --expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB. Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped by default so CI doesn't false-fail on GC timing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Production saw repeated
SIGABRT (signal 6, core dumped)crashes on/trpc/tools/search.webSearch— Node aborted with V8FATAL ERROR: ... allocation failed. Memory snapshots logged fromlobe-oom:web-browsing:search-serviceshowedrss=2889.3MB heap=2473.1MBright before the crash, well above Vercel's 1769 MB lambda ceiling.Root cause:
ssrfSafeFetchcallsawait response.arrayBuffer(), materialising the entire HTTP body into heap before returning. The naive crawler's downstream 1 MB truncation inhtmlToMarkdownruns after the body is already in memory, so a single oversized page (or a batch of three under defaultCRAWL_CONCURRENCY=3) can blow past the lambda memory limit before any guard kicks in.Changes
@lobechat/ssrf-safe-fetch: add opt-inmaxContentLengthoption. When set, the response body is consumed viafor await (const chunk of response.body)and reading stops the moment the cap is hit. Breaking out of the iterator closes it, which destroys the underlying stream and releases the HTTP connection. The returnedResponsecontains only the bytes received up to the cap (soft truncation — still treated as success). Default behaviour (fullarrayBuffer()read) is unchanged when the option is absent, so other callers (imageToBase64,videoToBase64, replicate provider) are unaffected.@lobechat/web-crawler/naive.ts: passmaxContentLength: MAX_HTML_SIZE(1 MB) so any body beyond that size is dropped at the network layer instead of being materialised in heap. Pages larger than the cap still crawl successfully, just truncated — same end-state as before, just with bounded memory.htmlToMarkdown: wrap the parse intry / finallyand callwindow.happyDOM.close()so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. JS evaluation is disabled in ourWindowconfig, so the returned promise resolves synchronously in practice — fire-and-forget.Why not lower
CRAWL_CONCURRENCY?It's already env-configurable (
CRAWL_CONCURRENCY, default 3) and is the right knob for ops to dial down per-deployment. The fix here removes the underlying memory bomb so concurrency=3 is actually safe.Regression tests
Three new tests in
packages/ssrf-safe-fetch/index.test.tsunder "OOM regression: bounded memory under oversized response bodies" — they verify the cap actually prevents the production SIGABRT scenario, not just produces a truncated body:CRAWL_CONCURRENCY=3should pull at most ~3 MB total, not 300 MB.--expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should growheapUsedby < 10 MB. Run withNODE_OPTIONS=--expose-gc bunx vitest runto exercise; skipped by default so CI doesn't false-fail on GC timing.Test plan
bunx vitest run packages/ssrf-safe-fetch— 32 passed, 1 skipped (heap-delta, requires--expose-gc)NODE_OPTIONS='--expose-gc' bunx vitest run packages/ssrf-safe-fetch— 33 passed including heap-deltabunx vitest run packages/web-crawler— 157 passed, htmlToMarkdown snapshots unchangedbunx vitest run src/server/services/search/index.test.ts— 26 passedbun run type-check— cleanvc logs -p lobehub-cloud-next --query 'Process exited'for 24h after deploy and confirm SIGABRT rate on/trpc/tools/search.webSearchdrops to zero🤖 Generated with Claude Code