Skip to content

security: size-guard importFromContent to stop cost amplification#46

Closed
garagon wants to merge 1 commit into
garrytan:masterfrom
garagon:security/f002-content-size-cap
Closed

security: size-guard importFromContent to stop cost amplification#46
garagon wants to merge 1 commit into
garrytan:masterfrom
garagon:security/f002-content-size-cap

Conversation

@garagon

@garagon garagon commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Summary

importFromFile caps disk reads at MAX_FILE_SIZE = 5_000_000 via statSync before delegating to importFromContent. importFromContent itself has no guard, and the remote MCP put_page operation passes caller-supplied content straight into it — so a bearer-token holder can hand the function a multi-megabyte string, force the recursive chunker (src/core/chunkers/recursive.ts, 300-word chunks with 50-word overlap) to produce thousands of chunks, and make the owner pay OpenAI to embed all of them. Nothing in the path caps input size, chunk count, or token spend. Sustained traffic is a denial-of-wallet primitive against the victim's OpenAI key.

This PR closes the asymmetry: the in-memory path now enforces the same MAX_FILE_SIZE ceiling the on-disk path has always enforced.

Changes

src/core/import-file.ts

  • Added a Buffer.byteLength(content, 'utf-8') > MAX_FILE_SIZE check as the first statement in importFromContent. Rejection returns { status: 'skipped', chunks: 0, error: \Content too large (${byteLength} bytes, max ${MAX_FILE_SIZE})` }— the same shapeimportFromFile` already uses for the oversized-file case, so nothing downstream needs to branch on a new exception.
  • Added a docblock paragraph calling out why the guard has to live on this function (the remote MCP put_page path) instead of on importFromFile (which already has its own statSync guard).

test/import-file.test.ts

Three new tests pinned to the guard:

  • rejects in-memory content larger than MAX_FILE_SIZE — feeds a 5.1 MB string directly to importFromContent, asserts status === 'skipped', error contains 'too large', and — critically — the engine mock sees zero calls. The last assertion is what proves the guard short-circuits before parseMarkdown, chunkText, or any allocation that scales with content size runs.
  • uses UTF-8 byte length, not JS string length, for the size check — feeds 2.6 million 4-byte emoji codepoints (~10.4 MB UTF-8 but only 2.6M UTF-16 code units). A naive content.length > MAX_FILE_SIZE would let this through; Buffer.byteLength catches it. This is the exact bypass an attacker would use if they knew the guard existed but not how it was implemented.
  • accepts in-memory content just under MAX_FILE_SIZE — 4.9 MB content still imports. Pins the limit so an off-by-one regression breaks the test instead of legitimate large-import usage.

Why this fix and not the others in the PoC's "fix suggestions"

The audit PoC (report/evidence/poc-f002-content-cost-amplification.ts) lists four fix directions: (a) Buffer.byteLength guard at the top of importFromContent, (b) chunk count ceiling, (c) Hono bodyLimit middleware, (d) per-token rate limiting. Only (a) is in this PR:

  • (a) is the root-cause fix. Once input byte length is bounded, everything downstream is bounded too: the recursive chunker is linear in input, and chunks are linear in bytes. A chunk count ceiling (b) becomes redundant the moment the byte count is capped.
  • (c) is a defense-in-depth layer that rejects at the transport boundary before Deno parses the JSON. It's the right thing to add, but it changes the Hono middleware stack and interacts with CORS preflight — a separate concern from this one-function import-pipeline fix. Keeping it out keeps this PR small enough to review in one sitting.
  • (d) requires schema work on access_tokens and a request-counter in mcp_request_log. Also a separate PR.

Validation

The PoC in report/evidence/poc-f002-content-cost-amplification.ts (from the upstream audit, not part of this diff) runs 8 static + runtime checks. Its pass-all verdict marks the vuln confirmed.

Before this PR (same file stashed, same PoC run):

[PASS] C1  importFromContent has NO size check
       No content.length, MAX_FILE_SIZE, or byteLength guards found
       inside importFromContent
[PASS] C2  importFromFile DOES have MAX_FILE_SIZE guard (asymmetry confirmed)
[PASS] C2b MAX_FILE_SIZE=5_000_000 constant exists at module scope
[PASS] C3  put_page handler calls importFromContent(ctx.engine, p.slug, p.content)
[PASS] C4  put_page is NOT in REMOTE_EXCLUDED (reachable via remote MCP)
[PASS] C5  No bodyLimit() middleware on Hono app
[PASS] C6  1 MB payload produces >300 chunks (no upper bound in code path)
[PASS] C7  10 MB payload (2x importFromFile limit) exceeds $0.50/request
RESULT: VULNERABILITY CONFIRMED  (exit 1)

After this PR:

[FAIL] C1  importFromContent has NO size check
       Found guard: contentLen=false, MAX_FILE_SIZE=true, byteLen=true
[PASS] C2  importFromFile DOES have MAX_FILE_SIZE guard (asymmetry confirmed)
[PASS] C2b MAX_FILE_SIZE=5_000_000 constant exists at module scope
[PASS] C3  put_page handler calls importFromContent(ctx.engine, p.slug, p.content)
[PASS] C4  put_page is NOT in REMOTE_EXCLUDED (reachable via remote MCP)
[PASS] C5  No bodyLimit() middleware on Hono app
[PASS] C6  1 MB payload produces >300 chunks (no upper bound in code path)
[PASS] C7  10 MB payload (2x importFromFile limit) exceeds $0.50/request
RESULT: NOT CONFIRMED  (exit 0)

C1 — the check that pins the actual bug ("importFromContent has NO size check") — is the only one that flips. The other checks (C4, C5, C6, C7) are design facts, not bug conditions: put_page is still reachable, Hono still has no bodyLimit, the chunker still produces 300 chunks for 1 MB, 10 MB still costs >$0.50 if it ever reached the chunker. None of that matters now because the guard at the top of importFromContent makes that path unreachable — the function returns 'skipped' before any of those downstream stages run. The PoC's "all must pass" verdict correctly flips because the only failure-critical check is C1.

Full unit suite:

bun test
 340 pass
 122 skip
 0 fail
 1163 expect() calls
Ran 462 tests across 29 files.

Test plan

  • bun test unit tests pass (pasted above — 12/12 in import-file.test.ts including the 3 new ones)
  • bun run test:e2e E2E tests pass against real Postgres + pgvector
  • Deploy the edge function, issue a write-scoped token (see also security: enforce scopes on remote MCP + log real op names #45), and verify:
    • 1 MB put_page payload returns { status: 'skipped', error: 'Content too large...' } and produces zero entries in mcp_request_log with operation = 'put_page' that resulted in embedding cost
    • 100 KB put_page payload still imports normally
    • CLI gbrain import on a large markdown tree is unaffected (disk path's statSync guard triggers first, so nothing over 5 MB ever gets handed to importFromContent)

Not in this PR

  • Hono bodyLimit middleware — defense-in-depth, separate PR.
  • Per-token rate limit on access_tokens.id — needs schema work.
  • A chunk count ceiling in the chunker — unnecessary once input bytes are bounded (chunks are linear in input).

Note: this PR stacks cleanly on top of #45 (scope enforcement + audit log fix). Both are scoped to the remote MCP attack surface but touch different files, so either can land first.

importFromFile caps disk reads at MAX_FILE_SIZE (5 MB) via statSync
before delegating to importFromContent. importFromContent itself has
no guard, and the remote MCP put_page operation passes caller-supplied
content straight into it — so a bearer-token holder can hand the
function a multi-megabyte string, force the recursive chunker to
produce thousands of chunks, and make the owner pay OpenAI to embed
all of them. Nothing in the path caps input size, chunk count, or
token spend. Sustained traffic is a denial-of-wallet primitive against
the victim's OpenAI key.

This PR adds the missing guard at the top of importFromContent so the
in-memory path matches the on-disk path:

- Reject content whose UTF-8 byte length exceeds MAX_FILE_SIZE before
  any parsing, chunking, or embedding allocation happens.
- Use Buffer.byteLength(content, 'utf-8') so the byte count matches
  exactly how disk size is measured in importFromFile — no JS string
  length mismatch that would let a multi-byte codepoint slip through.
- Return the same { status: 'skipped', error } shape that importFromFile
  returns for the "file too large" case, so callers (put_page op, MCP
  handler, CLI import) don't need to branch on a new exception type.

Tests added in test/import-file.test.ts:

- 5.1 MB ASCII content → skipped, error contains "too large",
  engine sees zero calls (verifies the guard short-circuits before
  any allocation).
- 2.6M × 4-byte emoji (~10.4 MB UTF-8, 2.6M UTF-16 code units) →
  skipped. Guards against the exact bypass a naive `content.length`
  check would allow.
- 4.9 MB content (just under the limit) → imported. Pins the limit
  so a future off-by-one regression breaks the test instead of the
  legitimate large-import path.

Backwards compatibility: importFromFile's existing guard already
blocks anything over 5 MB from ever reaching importFromContent via
the on-disk path, so no in-tree caller observes a behavior change.
The only callers that see the new rejection are those that pass
content directly — today that's put_page, which is the intended
target of the fix.

Not in this PR (defense in depth, out of scope):

- Hono bodyLimit middleware on the Edge Function — would reject at
  the transport layer before Deno parses the JSON, but touching the
  middleware stack is a separate concern from the import-file guard.
- Per-token rate limiting on access_tokens.id.
- A chunks.length ceiling inside the chunker — not needed once the
  input byte count is bounded (chunks are linear in input).

Refs: report/evidence/poc-f002-content-cost-amplification.ts
@garrytan

Copy link
Copy Markdown
Owner

Included in fix wave PR #65 (v0.9.1). Content size guard landed with an improved error message that includes an actionable fix suggestion. Thanks garagon! 🙏

@garrytan garrytan closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants