security: size-guard importFromContent to stop cost amplification#46
Closed
garagon wants to merge 1 commit into
Closed
security: size-guard importFromContent to stop cost amplification#46garagon wants to merge 1 commit into
garagon wants to merge 1 commit into
Conversation
importFromFile caps disk reads at MAX_FILE_SIZE (5 MB) via statSync
before delegating to importFromContent. importFromContent itself has
no guard, and the remote MCP put_page operation passes caller-supplied
content straight into it — so a bearer-token holder can hand the
function a multi-megabyte string, force the recursive chunker to
produce thousands of chunks, and make the owner pay OpenAI to embed
all of them. Nothing in the path caps input size, chunk count, or
token spend. Sustained traffic is a denial-of-wallet primitive against
the victim's OpenAI key.
This PR adds the missing guard at the top of importFromContent so the
in-memory path matches the on-disk path:
- Reject content whose UTF-8 byte length exceeds MAX_FILE_SIZE before
any parsing, chunking, or embedding allocation happens.
- Use Buffer.byteLength(content, 'utf-8') so the byte count matches
exactly how disk size is measured in importFromFile — no JS string
length mismatch that would let a multi-byte codepoint slip through.
- Return the same { status: 'skipped', error } shape that importFromFile
returns for the "file too large" case, so callers (put_page op, MCP
handler, CLI import) don't need to branch on a new exception type.
Tests added in test/import-file.test.ts:
- 5.1 MB ASCII content → skipped, error contains "too large",
engine sees zero calls (verifies the guard short-circuits before
any allocation).
- 2.6M × 4-byte emoji (~10.4 MB UTF-8, 2.6M UTF-16 code units) →
skipped. Guards against the exact bypass a naive `content.length`
check would allow.
- 4.9 MB content (just under the limit) → imported. Pins the limit
so a future off-by-one regression breaks the test instead of the
legitimate large-import path.
Backwards compatibility: importFromFile's existing guard already
blocks anything over 5 MB from ever reaching importFromContent via
the on-disk path, so no in-tree caller observes a behavior change.
The only callers that see the new rejection are those that pass
content directly — today that's put_page, which is the intended
target of the fix.
Not in this PR (defense in depth, out of scope):
- Hono bodyLimit middleware on the Edge Function — would reject at
the transport layer before Deno parses the JSON, but touching the
middleware stack is a separate concern from the import-file guard.
- Per-token rate limiting on access_tokens.id.
- A chunks.length ceiling inside the chunker — not needed once the
input byte count is bounded (chunks are linear in input).
Refs: report/evidence/poc-f002-content-cost-amplification.ts
This was referenced Apr 11, 2026
Owner
|
Included in fix wave PR #65 (v0.9.1). Content size guard landed with an improved error message that includes an actionable fix suggestion. Thanks garagon! 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
importFromFilecaps disk reads atMAX_FILE_SIZE = 5_000_000viastatSyncbefore delegating toimportFromContent.importFromContentitself has no guard, and the remote MCPput_pageoperation passes caller-suppliedcontentstraight into it — so a bearer-token holder can hand the function a multi-megabyte string, force the recursive chunker (src/core/chunkers/recursive.ts, 300-word chunks with 50-word overlap) to produce thousands of chunks, and make the owner pay OpenAI to embed all of them. Nothing in the path caps input size, chunk count, or token spend. Sustained traffic is a denial-of-wallet primitive against the victim's OpenAI key.This PR closes the asymmetry: the in-memory path now enforces the same
MAX_FILE_SIZEceiling the on-disk path has always enforced.Changes
src/core/import-file.tsBuffer.byteLength(content, 'utf-8') > MAX_FILE_SIZEcheck as the first statement inimportFromContent. Rejection returns{ status: 'skipped', chunks: 0, error: \Content too large (${byteLength} bytes, max ${MAX_FILE_SIZE})` }— the same shapeimportFromFile` already uses for the oversized-file case, so nothing downstream needs to branch on a new exception.put_pagepath) instead of onimportFromFile(which already has its ownstatSyncguard).test/import-file.test.tsThree new tests pinned to the guard:
rejects in-memory content larger than MAX_FILE_SIZE— feeds a 5.1 MB string directly toimportFromContent, assertsstatus === 'skipped',errorcontains'too large', and — critically — the engine mock sees zero calls. The last assertion is what proves the guard short-circuits beforeparseMarkdown,chunkText, or any allocation that scales with content size runs.uses UTF-8 byte length, not JS string length, for the size check— feeds 2.6 million 4-byte emoji codepoints (~10.4 MB UTF-8 but only 2.6M UTF-16 code units). A naivecontent.length > MAX_FILE_SIZEwould let this through;Buffer.byteLengthcatches it. This is the exact bypass an attacker would use if they knew the guard existed but not how it was implemented.accepts in-memory content just under MAX_FILE_SIZE— 4.9 MB content still imports. Pins the limit so an off-by-one regression breaks the test instead of legitimate large-import usage.Why this fix and not the others in the PoC's "fix suggestions"
The audit PoC (
report/evidence/poc-f002-content-cost-amplification.ts) lists four fix directions: (a)Buffer.byteLengthguard at the top ofimportFromContent, (b) chunk count ceiling, (c) HonobodyLimitmiddleware, (d) per-token rate limiting. Only (a) is in this PR:access_tokensand a request-counter inmcp_request_log. Also a separate PR.Validation
The PoC in
report/evidence/poc-f002-content-cost-amplification.ts(from the upstream audit, not part of this diff) runs 8 static + runtime checks. Its pass-all verdict marks the vuln confirmed.Before this PR (same file stashed, same PoC run):
After this PR:
C1 — the check that pins the actual bug ("importFromContent has NO size check") — is the only one that flips. The other checks (C4, C5, C6, C7) are design facts, not bug conditions:
put_pageis still reachable, Hono still has nobodyLimit, the chunker still produces 300 chunks for 1 MB, 10 MB still costs >$0.50 if it ever reached the chunker. None of that matters now because the guard at the top ofimportFromContentmakes that path unreachable — the function returns'skipped'before any of those downstream stages run. The PoC's "all must pass" verdict correctly flips because the only failure-critical check is C1.Full unit suite:
Test plan
bun testunit tests pass (pasted above — 12/12 inimport-file.test.tsincluding the 3 new ones)bun run test:e2eE2E tests pass against real Postgres + pgvectorput_pagepayload returns{ status: 'skipped', error: 'Content too large...' }and produces zero entries inmcp_request_logwithoperation = 'put_page'that resulted in embedding costput_pagepayload still imports normallygbrain importon a large markdown tree is unaffected (disk path'sstatSyncguard triggers first, so nothing over 5 MB ever gets handed toimportFromContent)Not in this PR
bodyLimitmiddleware — defense-in-depth, separate PR.access_tokens.id— needs schema work.Note: this PR stacks cleanly on top of #45 (scope enforcement + audit log fix). Both are scoped to the remote MCP attack surface but touch different files, so either can land first.