Skip to content

🐛 fix(web-crawler): cap response body size to prevent serverless OOM#14660

Merged
arvinxx merged 2 commits into
canaryfrom
fix/web-crawler-body-size-cap
May 12, 2026
Merged

🐛 fix(web-crawler): cap response body size to prevent serverless OOM#14660
arvinxx merged 2 commits into
canaryfrom
fix/web-crawler-body-size-cap

Conversation

@arvinxx

@arvinxx arvinxx commented May 11, 2026

Copy link
Copy Markdown
Member

Summary

Production saw repeated SIGABRT (signal 6, core dumped) crashes on /trpc/tools/search.webSearch — Node aborted with V8 FATAL ERROR: ... allocation failed. Memory snapshots logged from lobe-oom:web-browsing:search-service showed rss=2889.3MB heap=2473.1MB right before the crash, well above Vercel's 1769 MB lambda ceiling.

Root cause: ssrfSafeFetch calls await response.arrayBuffer(), materialising the entire HTTP body into heap before returning. The naive crawler's downstream 1 MB truncation in htmlToMarkdown runs after the body is already in memory, so a single oversized page (or a batch of three under default CRAWL_CONCURRENCY=3) can blow past the lambda memory limit before any guard kicks in.

Changes

  • @lobechat/ssrf-safe-fetch: add opt-in maxContentLength option. When set, the response body is consumed via for await (const chunk of response.body) and reading stops the moment the cap is hit. Breaking out of the iterator closes it, which destroys the underlying stream and releases the HTTP connection. The returned Response contains only the bytes received up to the cap (soft truncation — still treated as success). Default behaviour (full arrayBuffer() read) is unchanged when the option is absent, so other callers (imageToBase64, videoToBase64, replicate provider) are unaffected.
  • @lobechat/web-crawler / naive.ts: pass maxContentLength: MAX_HTML_SIZE (1 MB) so any body beyond that size is dropped at the network layer instead of being materialised in heap. Pages larger than the cap still crawl successfully, just truncated — same end-state as before, just with bounded memory.
  • htmlToMarkdown: wrap the parse in try / finally and call window.happyDOM.close() so the parsed DOM tree is released as soon as parsing finishes, rather than waiting for the function scope to drop. JS evaluation is disabled in our Window config, so the returned promise resolves synchronously in practice — fire-and-forget.

Why not lower CRAWL_CONCURRENCY?

It's already env-configurable (CRAWL_CONCURRENCY, default 3) and is the right knob for ops to dial down per-deployment. The fix here removes the underlying memory bomb so concurrency=3 is actually safe.

Regression tests

Three new tests in packages/ssrf-safe-fetch/index.test.ts under "OOM regression: bounded memory under oversized response bodies" — they verify the cap actually prevents the production SIGABRT scenario, not just produces a truncated body:

  1. Source-pull bound: a body source with 200 MB available, capped at 1 MB, must not be drained beyond ~1 MB. Asserts on bytes pulled from the generator — the property that prevents OOM at the source.
  2. Concurrency bound (production scenario): three concurrent oversized fetches matching CRAWL_CONCURRENCY=3 should pull at most ~3 MB total, not 300 MB.
  3. Heap-delta bound (gated on --expose-gc): under real GC pressure, fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB. Run with NODE_OPTIONS=--expose-gc bunx vitest run to exercise; skipped by default so CI doesn't false-fail on GC timing.

Test plan

  • bunx vitest run packages/ssrf-safe-fetch — 32 passed, 1 skipped (heap-delta, requires --expose-gc)
  • NODE_OPTIONS='--expose-gc' bunx vitest run packages/ssrf-safe-fetch — 33 passed including heap-delta
  • bunx vitest run packages/web-crawler — 157 passed, htmlToMarkdown snapshots unchanged
  • bunx vitest run src/server/services/search/index.test.ts — 26 passed
  • bun run type-check — clean
  • Production verification: monitor vc logs -p lobehub-cloud-next --query 'Process exited' for 24h after deploy and confirm SIGABRT rate on /trpc/tools/search.webSearch drops to zero

🤖 Generated with Claude Code

Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch`
where Node aborted with V8 "allocation failed" — the naive crawler buffered
entire response bodies into heap before the 1 MB downstream truncation could
apply, so a single large page (or a batch of three under default
concurrency=3) could push rss past the lambda memory ceiling.

- ssrfSafeFetch: add opt-in `maxContentLength` that streams the response
  body via `for await` and stops at the cap (soft truncation — still a
  successful response). Breaking the iterator destroys the underlying
  stream and releases the connection. Default behaviour (full
  `arrayBuffer()` read) unchanged when the option is absent.
- naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond
  1 MB is dropped at the network layer instead of being materialised in heap.
- htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally
  block so the parsed DOM tree is released as soon as parsing finishes,
  rather than waiting for the function scope to drop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment May 11, 2026 6:03am

Request Review

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 11, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5c0dfc64ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import type { CrawlImpl, CrawlSuccessResult } from '../type';
import { PageNotFoundError, toFetchError } from '../utils/errorType';
import { htmlToMarkdown } from '../utils/htmlToMarkdown';
import { htmlToMarkdown, MAX_HTML_SIZE } from '../utils/htmlToMarkdown';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Export the mocked size constant in crawler tests

Importing MAX_HTML_SIZE here breaks the existing packages/web-crawler/src/crawImpl/__tests__/naive.test.ts mock for ../../utils/htmlToMarkdown, which only returns htmlToMarkdown. Under Vitest's factory mocks, the newly imported export is missing, so the naive crawler tests fail before exercising this code unless the mock also provides MAX_HTML_SIZE (or uses importOriginal).

Useful? React with 👍 / 👎.

@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.18%. Comparing base (db22573) to head (d7c057d).

Additional details and impacted files
@@            Coverage Diff            @@
##           canary   #14660     +/-   ##
=========================================
  Coverage   66.18%   66.18%             
=========================================
  Files        2897     2897             
  Lines      253594   253602      +8     
  Branches    24739    29868   +5129     
=========================================
+ Hits       167831   167837      +6     
- Misses      85611    85613      +2     
  Partials      152      152             
Flag Coverage Δ
app 60.55% <ø> (+<0.01%) ⬆️
database 91.82% <ø> (ø)
packages/agent-runtime 80.48% <ø> (ø)
packages/builtin-tool-lobe-agent 83.41% <ø> (ø)
packages/context-engine 84.00% <ø> (ø)
packages/conversation-flow 92.43% <ø> (ø)
packages/file-loaders 87.60% <ø> (ø)
packages/memory-user-memory 74.74% <ø> (ø)
packages/model-bank 99.94% <ø> (ø)
packages/model-runtime 83.67% <ø> (ø)
packages/prompts 70.39% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/types 5.44% <ø> (ø)
packages/utils 88.02% <ø> (ø)
packages/web-crawler 87.74% <80.00%> (-0.43%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 66.85% <ø> (ø)
Services 54.16% <ø> (ø)
Server 71.61% <ø> (+<0.01%) ⬆️
Libs 55.22% <ø> (ø)
Utils 82.51% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Verify that the maxContentLength cap actually prevents the production SIGABRT
scenario, not just produces a truncated body.

- Source-pull bound: a body source with 200 MB available, capped at 1 MB,
  must not be drained beyond ~1 MB. Asserts on bytes pulled from the
  generator, which is the property that prevents OOM.
- Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three
  concurrent oversized fetches should pull at most ~3 MB total, not 300 MB.
- Heap-delta bound (gated on --expose-gc): under real GC pressure,
  fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB.
  Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped
  by default so CI doesn't false-fail on GC timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@arvinxx arvinxx merged commit ca873e3 into canary May 12, 2026
34 of 35 checks passed
@arvinxx arvinxx deleted the fix/web-crawler-body-size-cap branch May 12, 2026 15:21
emaxlele pushed a commit to emaxlele/lobehub that referenced this pull request May 12, 2026
…obehub#14660)

* 🐛 fix(web-crawler): cap response body size to prevent serverless OOM

Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch`
where Node aborted with V8 "allocation failed" — the naive crawler buffered
entire response bodies into heap before the 1 MB downstream truncation could
apply, so a single large page (or a batch of three under default
concurrency=3) could push rss past the lambda memory ceiling.

- ssrfSafeFetch: add opt-in `maxContentLength` that streams the response
  body via `for await` and stops at the cap (soft truncation — still a
  successful response). Breaking the iterator destroys the underlying
  stream and releases the connection. Default behaviour (full
  `arrayBuffer()` read) unchanged when the option is absent.
- naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond
  1 MB is dropped at the network layer instead of being materialised in heap.
- htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally
  block so the parsed DOM tree is released as soon as parsing finishes,
  rather than waiting for the function scope to drop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap

Verify that the maxContentLength cap actually prevents the production SIGABRT
scenario, not just produces a truncated body.

- Source-pull bound: a body source with 200 MB available, capped at 1 MB,
  must not be drained beyond ~1 MB. Asserts on bytes pulled from the
  generator, which is the property that prevents OOM.
- Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three
  concurrent oversized fetches should pull at most ~3 MB total, not 300 MB.
- Heap-delta bound (gated on --expose-gc): under real GC pressure,
  fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB.
  Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped
  by default so CI doesn't false-fail on GC timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arvinxx added a commit that referenced this pull request May 12, 2026
…14660)

* 🐛 fix(web-crawler): cap response body size to prevent serverless OOM

Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch`
where Node aborted with V8 "allocation failed" — the naive crawler buffered
entire response bodies into heap before the 1 MB downstream truncation could
apply, so a single large page (or a batch of three under default
concurrency=3) could push rss past the lambda memory ceiling.

- ssrfSafeFetch: add opt-in `maxContentLength` that streams the response
  body via `for await` and stops at the cap (soft truncation — still a
  successful response). Breaking the iterator destroys the underlying
  stream and releases the connection. Default behaviour (full
  `arrayBuffer()` read) unchanged when the option is absent.
- naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond
  1 MB is dropped at the network layer instead of being materialised in heap.
- htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally
  block so the parsed DOM tree is released as soon as parsing finishes,
  rather than waiting for the function scope to drop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap

Verify that the maxContentLength cap actually prevents the production SIGABRT
scenario, not just produces a truncated body.

- Source-pull bound: a body source with 200 MB available, capped at 1 MB,
  must not be drained beyond ~1 MB. Asserts on bytes pulled from the
  generator, which is the property that prevents OOM.
- Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three
  concurrent oversized fetches should pull at most ~3 MB total, not 300 MB.
- Heap-delta bound (gated on --expose-gc): under real GC pressure,
  fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB.
  Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped
  by default so CI doesn't false-fail on GC timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 12, 2026
arvinxx added a commit that referenced this pull request May 13, 2026
# 🚀 LobeHub Release (20260513)

**Hotfix Scope:** Ship the canary backlog (111 PRs) onto main as a
fast-tracked patch — operator-focused, no weekly-style write-up.

> Brings the accumulated canary work into main: agent/task improvements,
hetero-agent fixes, desktop & onboarding polish, and several reliability
caps.

## ✨ What's Included

- **Agent & tasks** — Self-review proposal-to-action automation,
sub-agent dispatch consolidated to `lobe-agent`, AskUserQuestion wiring
for Claude Code, scheduler/hotkey/TodoList polish. (#14583, #14657,
#14715, #14639, #14732, #14707, #14713)
- **Home & onboarding** — Daily brief with linkable welcome + paired
input hint, inline skill auth in recommended task templates, cleanup of
captcha-on-signin and marketplace early-exit. (#14589, #14676, #14573,
#14598)
- **Bots & integrations** — Slack MPIM support, Discord DM fix,
slash-command + connect-error fixes, gateway client-tool plugin state.
(#14733, #14591, #14596)
- **Desktop & CLI** — Windows `.cmd` shim detection for `claude` /
`codex` CLIs, auth focus & pending-login reset fixes. (#14720, #14694,
#14695)
- **Reliability** — Cap web-crawler body size and image binary at safe
limits, attach error listeners to Neon/Node pools, reject inactive OIDC
access. (#14660, #14711, #14606, #14674)
- **Database** — `agent_operations` table + persist agent operations
from the runtime; switch user memory search to `paradedb.match(...)`.
(#14416, #14736, #14590)

## ⚙️ Upgrade

- **Self-hosted:** pull the latest image and restart. Drizzle migrations
(including the new `agent_operations` table) run automatically on boot.
lezi-fun pushed a commit to lezi-fun/lobehub that referenced this pull request May 13, 2026
…obehub#14660)

* 🐛 fix(web-crawler): cap response body size to prevent serverless OOM

Production saw repeated SIGABRT crashes on `/trpc/tools/search.webSearch`
where Node aborted with V8 "allocation failed" — the naive crawler buffered
entire response bodies into heap before the 1 MB downstream truncation could
apply, so a single large page (or a batch of three under default
concurrency=3) could push rss past the lambda memory ceiling.

- ssrfSafeFetch: add opt-in `maxContentLength` that streams the response
  body via `for await` and stops at the cap (soft truncation — still a
  successful response). Breaking the iterator destroys the underlying
  stream and releases the connection. Default behaviour (full
  `arrayBuffer()` read) unchanged when the option is absent.
- naive crawler: pass `maxContentLength: MAX_HTML_SIZE` so any body beyond
  1 MB is dropped at the network layer instead of being materialised in heap.
- htmlToMarkdown: explicitly call `window.happyDOM.close()` in a finally
  block so the parsed DOM tree is released as soon as parsing finishes,
  rather than waiting for the function scope to drop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ✅ test(ssrf-safe-fetch): add OOM regression tests for response body cap

Verify that the maxContentLength cap actually prevents the production SIGABRT
scenario, not just produces a truncated body.

- Source-pull bound: a body source with 200 MB available, capped at 1 MB,
  must not be drained beyond ~1 MB. Asserts on bytes pulled from the
  generator, which is the property that prevents OOM.
- Concurrency bound: matches production CRAWL_CONCURRENCY=3 — three
  concurrent oversized fetches should pull at most ~3 MB total, not 300 MB.
- Heap-delta bound (gated on --expose-gc): under real GC pressure,
  fetching a 50 MB body with a 1 MB cap should grow heapUsed by < 10 MB.
  Run with `NODE_OPTIONS=--expose-gc bunx vitest run` to exercise; skipped
  by default so CI doesn't false-fail on GC timing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant