fix(hooks): self-heal partial plugin cache install from marketplace clone#699
Conversation
NgoQuocViet2001
left a comment
There was a problem hiding this comment.
Manual Windows + Codex review notes:
- Targeted heal tests passed locally.
pnpm run typecheckandpnpm run buildpassed locally.- I also reproduced a partial Claude plugin cache layout manually and confirmed the heal restores
start.mjs,cli.bundle.mjs,server.bundle.mjs, clears the partial probe, and rewrites staleargs[0]to the current version.
One path-containment guard concern inline before merge.
| const out = new Set(); | ||
| for (const entry of files) { | ||
| if (typeof entry !== "string" || !entry) continue; | ||
| const abs = join(rootDir, entry); |
There was a problem hiding this comment.
The heal flow works in my manual partial-install repro, but I think we need one more containment guard around files[] entries before merge. I tested a malformed marketplace package.json with files: ["../outside.txt", ...]; expandFilesArray() accepts it here, and the later copy loop writes outside pluginRoot. Could we reject absolute / .. entries, or resolve both from and to and ensure they stay under marketplaceClonePath and pluginRoot respectively before copying? That keeps a corrupted marketplace clone from turning the self-heal into an out-of-root write.
There was a problem hiding this comment.
Claude (Anthropic, Opus 4.7) authoring this reply on Sultan's behalf.
Thanks for the catch! ..-escape entries are now rejected in expandFilesArray via lexical resolve(abs).startsWith(rootDir + sep), and both resolve(from) and resolve(to) are re-checked against their roots at the copy site as defense in depth.
Follow-up adversarial review surfaced a few adjacent vectors that landed in the same PR. On the source side we use lstat(from).isSymbolicLink() to drop leaf symlinks and realpathSync(from).startsWith(realpathSync(marketplaceClonePath) + sep) to collapse ancestor symlinks (e.g. a marketplace scripts/ dir swapped for a symlink to outside). On the destination side we lstat(to) and unlinkSync any pre-existing symlink, and the write itself uses writeFileSync(..., {flag: "wx"}) so a re-planted symlink at to after the unlink can't be followed via the open. plugin.json reads go through openSync(O_RDONLY | O_NOFOLLOW) + readFileSync(fd) so a symlink-redirected read can't feed attacker JSON into the args rewrite, and the rewrite itself writes to a 64-bit-random tmp suffix under flag: "wx" plus renameSync for atomicity.
We also added an explicit not-claude-code skip for any pluginRoot that doesn't match the CC cache layout, so non-CC clients (Codex, Cursor, OpenCode, Kiro, ...) bail before any FS work runs. The Contract block at the top of hooks/heal-partial-install.mjs enumerates the layered guards in full.
Test count went from 191 to 198 with regressions for each closed vector.
620d666 to
9d39497
Compare
NgoQuocViet2001
left a comment
There was a problem hiding this comment.
@kerneltoast Re-tested after your containment fix. The original files[] traversal concern looks addressed to me.
Checked on Windows:
- Normal partial cache repro heals
start.mjs,cli.bundle.mjs,server.bundle.mjs, andpackage.json, clears the partial probe, and rewrites staleargs[0]to the current version. - Malformed marketplace
package.jsonwithfiles: ["../OUTSIDE.txt", ...]is rejected without writing outsidepluginRoot; legitimate entries still heal. - Non-Claude-Code plugin roots skip with
not-claude-code.
Ran:
node node_modules/typescript/bin/tsc --noEmitnpm run buildnode node_modules/vitest/vitest.mjs run tests/core/cli.test.ts
One non-blocking Windows-local note: on my machine without symlink privilege / Developer Mode, 3 symlink-specific tests fail at symlinkSync(..., ...) with EPERM. The PR's Windows CI is green, so I don't see it as a merge blocker; just something to keep in mind for local Windows dev boxes.
The /ctx-upgrade swap loop in cli.ts and the inline-fallback variant in server.ts both iterate pkg.files[] read straight out of the freshly cloned upstream package.json and hand each entry to rmSync plus cpSync without checking that the resolved destination stays inside pluginRoot. A compromised upstream tag at kerneltoast/context-mode (cli.ts) or mksglu/context-mode (server.ts inline) shipping "files": ["../../.ssh/authorized_keys", ...] would, on the next /ctx-upgrade, delete and overwrite arbitrary files under the user's UID. cli.ts is additionally vulnerable to absolute-path entries since path.resolve(P, "/abs") discards P; server.ts uses path.join, which defuses that variant but still follows relative "..". Fix it by mirroring the lexical containment pattern that hooks/heal-partial-install.mjs (mksglu#699) already uses for its own files[] expansion: resolve(rootDir) + sep, then drop items whose resolved path doesn't startsWith it. Both ends are checked; the "from" side blocks symlink-anchor variants where files[] is benign but a symlink inside the clone redirects. server.ts's inline script also switches from join() to resolve() so absolute-path entries normalize to themselves and then fail the pluginRoot prefix check. Tests in tests/core/cli.test.ts verify the guard appears at both sites and exercise the algorithm against a sandbox tmpdir tree where the malicious items target a planted victim file outside pluginRoot. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @kerneltoast. Conflicting on The conflict is benign: #716 added Verdict was MERGE_AS_IS — re-ping after rebase. The 712-LoC heal module + 6-layer path-traversal defense in |
…on start What was observed on the reproducer machine, directly from the filesystem and session logs: The per-version cache dir at ~/.claude/plugins/cache/context-mode/context-mode/1.0.150/ held 73 files of the marketplace's tracked tree. hooks/, .claude-plugin/, .cursor-plugin/, .pi/, web/ and a few docs were present; cli.bundle.mjs, server.bundle.mjs, start.mjs, package.json, src/, bin/, scripts/, skills/, .codex-plugin/ and .openclaw-plugin/ were absent. The cache's .claude-plugin/plugin.json was a verbatim carry-forward from a previous 1.0.146 install, with mcpServers.command = "/usr/bin/bun" and mcpServers.args[0] referencing the 1.0.146 cache dir, which had been age-gate-cleaned by hooks/sessionstart.mjs (mksglu#181) and no longer existed on disk. None of the existing defenses repair this: - scripts/plugin-cache-integrity.mjs (mksglu#550) exits 2 from start.mjs, but start.mjs is one of the missing files. - The mksglu#604 stale-cache-version ratchet in hooks/normalize-hooks.mjs runs from start.mjs's normalize-hooks call, same boot-time dependency. - /ctx-upgrade needs the cli.bundle.mjs the partial install lost, so the user-facing escape hatch is gone too. So the cache cannot self-recover, and the only available workaround from the user side is `rm -rf <version-dir>` plus a session restart. The user reported hitting this several times across upgrade cycles. What I could verify directly about the producing code path: Claude Code runs a plugin sync regardless of the marketplace's autoUpdate setting. With `"autoUpdate": false` in ~/.claude/settings.json for the mksglu/context-mode marketplace, a read-only `claude plugin list` invocation still created a new per-version cache dir matching the marketplace's current version. So autoUpdate=false does not opt out of CC's "ensure cache matches marketplace version" sync. What I cannot verify: which CC code path produced THIS particular fingerprint (partial copy plus carry-forward plugin.json). The reproducer's /ctx-upgrade transcripts show the cli.ts upgrade() rm/cp loop took its "Already on latest" early-return path and did not run, so that's not the producer here. A clean-marketplace `claude plugin list` repro on my machine produced a *complete* cache install, not a partial one, so the specific failure mode needs additional conditions I could not pin down. Three adjacent upstream bug reports describe overlapping failure modes in CC's plugin sync: anthropics/claude-code#48912 (Update fails to materialize newly-added top-level directories), #57645 (installPath in installed_plugins.json lags cache extraction non-atomically), and #57646 (gitCommitSha in the registry never refreshed on /plugin update). I'm NOT asserting any one of those bugs is the exact mechanism in this instance; I'm citing them as documented adjacent failure modes in the same subsystem. Fix it by adding hooks/heal-partial-install.mjs. The module detects partial install on every session start (cheap 4-existsSync probe keeps the healthy case effectively free) and re-copies missing files from the marketplace clone CC maintains at ~/.claude/plugins/marketplaces/<owner>/. The clone is CC's canonical source for the per-version cache dir when its sync works correctly, so trusting it sidesteps whichever sync path produced the partial copy. We also rewrite the carry-forward stale args[0] in plugin.json (mirroring the mksglu#604 ratchet) so the next MCP launch targets the current pluginRoot. Both halves are mechanism-agnostic; they don't depend on knowing which CC code path created the broken state. Beyond the core re-copy, the module's destructive paths are hardened against path traversal and symlink attacks. files[] entries that resolve outside rootDir are dropped at expansion. Directory walks use lstat and skip symlinks, so a planted symlink in the marketplace tree can't be harvested as a regular file. Just before each copy, we lstat plus realpath on the source to catch both leaf and ancestor symlinks, lstat the destination to unlink any pre-existing symlink, and use writeFileSync with `flag: "wx"` so a re-planted symlink at the leaf can't be followed via the open. plugin.json reads go through openSync with O_NOFOLLOW so a redirected read can't feed attacker JSON into the args rewrite, and the rewrite itself writes to an unguessable random-suffixed tmp file plus renameSync for atomicity. The Contract block in the module header enumerates the layered guards in full. The heal is also scoped to CC's per-version cache layout: any pluginRoot that doesn't match ~/.claude/plugins/cache/<owner>/<plugin>/<version>/ bails immediately with `skipped: "not-claude-code"`, before any FS work runs. Other clients (Codex, Cursor, OpenCode, Kiro, ...) ship their own SessionStart wrappers under hooks/<client>/ and don't go through this module. The heal lives in hooks/ rather than scripts/ because the failure mode itself can strip scripts/ from the cache dir; hooks/ has been intact in every observed instance. Wired from hooks/sessionstart.mjs (primary trigger; fires even when the MCP server can't boot) and from start.mjs (belt-and-braces, runs before the Algo-D4 integrity gate so a fixable partial install is repaired rather than just reported). Note that this lands in vN+1 and heals broken installs from that point forward. Users currently hitting the failure need to manually rm -rf their cache dir once to clear the old state. From then on the heal keeps the install healthy across all future CC sync cycles. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cli.ts's upgrade() rm/cp loop swallows cpSync failures via a catch that was meant to absorb "some files may not exist in source". The catch fires after rmSync has already run, though, so any cp failure (or a `files[]` entry that doesn't exist in the cloned source tree) leaves the cache dir with a deleted file and no replacement. That's a partial-install vector on the /ctx-upgrade path, separate from the broader CC-side partial install that the new heal in hooks/heal-partial-install.mjs backstops. server.ts's inline-fallback upgrade (the rescue path used when neither cli.bundle.mjs nor build/cli.js exists at install time) already guards with `if (existsSync(from))` before any rm/cp. Back-port that same check to cli.ts so the safe pattern covers both entrypoints. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a63d96c to
ed2d089
Compare
What / Why / How
What: A new
hooks/heal-partial-install.mjsthat detects a partial plugin cache install on every session start and re-copies missing files from the marketplace clone Claude Code maintains, plus a back-portedexistsSyncguard oncli.ts's/ctx-upgradeswap loop that closes a related latent vector.Why (direct observations from the reproducer machine; everything in this section is fact, not speculation):
The per-version plugin cache dir at
~/.claude/plugins/cache/context-mode/context-mode/1.0.150/held 73 files of the marketplace's tracked tree. Present:hooks/,.claude-plugin/,.cursor-plugin/,.pi/,web/, plus a few docs. Absent:cli.bundle.mjs,server.bundle.mjs,start.mjs,package.json,src/,bin/,scripts/,skills/,.codex-plugin/,.openclaw-plugin/.The cache's
.claude-plugin/plugin.jsonwas a verbatim carry-forward from the user's previous 1.0.146 install, withmcpServers.command = "/usr/bin/bun"andmcpServers.args[0]referencing the 1.0.146 cache dir (which had been age-gate-cleaned byhooks/sessionstart.mjs(#181) and no longer existed). MCP launch ENOENTs every spawn.None of the existing defenses repairs this:
scripts/plugin-cache-integrity.mjs(1.0.123 npm tarball missing server.bundle.mjs and start.mjs — MCP server silently fails on CC plugin installs #550) exits 2 fromstart.mjs, butstart.mjsis one of the missing files.start.mjs's normalize-hooks call, same boot-time dependency./ctx-upgradeneeds thecli.bundle.mjsthe partial install lost, so the user-facing escape hatch is gone too.So the cache cannot self-recover, and the only available workaround from the user side is
rm -rf <version-dir>plus a session restart. The user reported hitting this several times across upgrade cycles.What I could verify directly about the producing code path (also fact): Claude Code runs a plugin sync regardless of the marketplace's
autoUpdatesetting. With"autoUpdate": falsein~/.claude/settings.jsonfor themksglu/context-modemarketplace, a read-onlyclaude plugin listinvocation still created a new per-version cache dir matching the marketplace's current version. SoautoUpdate=falsedoes not opt out of CC's "ensure cache matches marketplace version" sync.What I cannot verify (speculation, flagged explicitly): which CC code path produced this particular fingerprint (partial copy plus carry-forward
plugin.json). The reproducer's/ctx-upgradetranscripts show this project's owncli.ts upgrade()took its "Already on latest" early-return path and the rm/cp loop did not run, so that's ruled out as the producer here. A clean-marketplaceclaude plugin listrepro on my machine produced a complete cache install, not a partial one, so the specific failure mode needs additional conditions I could not pin down. Three adjacent upstream bug reports describe overlapping failure modes in CC's plugin sync: anthropics/claude-code#48912 (Update fails to materialize newly-added top-level directories), #57645 (installPathininstalled_plugins.jsonlags cache extraction non-atomically), and #57646 (gitCommitShain the registry never refreshed). I'm not asserting any one of those bugs is the exact mechanism in this instance; they are cited as documented adjacent failure modes in the same subsystem.How: The heal module ships in
hooks/rather thanscripts/because the failure mode itself can stripscripts/from the cache dir;hooks/has been intact in every observed instance. Wired fromhooks/sessionstart.mjs(primary trigger; fires even when MCP can't boot) and fromstart.mjs(belt-and-braces, runs before the Algo-D4 integrity gate so a fixable partial install is repaired rather than just reported).Detection is a cheap 4-existsSync probe, so the healthy case is effectively free. On a trip, the module reads
package.json files[](preferspluginRoot's, falls back to the marketplace's whenpluginRoot/package.jsonis one of the missing files), expands to a concrete file list against the marketplace clone, and copies each entry that's missing in pluginRoot. After files land, it rewrites any carry-forward stalemcpServers.args[0]inplugin.jsonto the current pluginRoot, mirroring the #604 ratchet. The module is pure JS (Node built-ins only), never throws, idempotent, and path-traversal-guarded.The secondary commit closes a separate partial-install vector on the
/ctx-upgradepath:cli.ts:1207-1212'stry { rmSync; cpSync } catch {}would swallowcpSyncfailures (orfiles[]entries that don't exist in the cloned source) and leave files deleted without replacement. The safe pattern (if (!existsSync(from)) continue) is already used inserver.ts:3820's inline-fallback path; this back-ports it.Caveat (vN+1 only): This lands in
vN+1and heals broken installs from that point forward. Users currently hitting the failure need to manuallyrm -rftheir cache dir once to clear the old state. From then on the heal keeps the install healthy across all future CC sync cycles.Fixes: this user's recurring partial-install issue. Related upstream (in CC's plugin sync, not this project): anthropics/claude-code#48912, anthropics/claude-code#57645, anthropics/claude-code#57646.
Affected platforms
The heal probe and copy logic are CC-cache-layout-specific (
<config>/plugins/cache/<owner>/<plugin>/<version>/).deriveMarketplaceClonePathreturnsnullfor any other layout (npm-global, dev checkout, opencode cache, ...) and the heal becomes a no-op. The secondarycli.tsguard is platform-agnostic since the upgrade flow is shared.Test plan
26 new tests added to
tests/core/cli.test.ts(folded into existing partial-install describe block per CONTRIBUTING L282; no new test files). Coverage:isPartialInstallcheap probe across the launch-critical file matrix (healthy, missingstart.mjs, missingpackage.json, missing bothcli.bundle.mjs+build/cli.js, acceptingbuild/cli.jsas fallback, falsy input).deriveMarketplaceClonePathlayout helper across POSIX paths, trailing slashes, Windows backslashes, npm-global, dev checkout, and falsy input.healPartialInstallFromMarketplacecovering: early-return on healthy install, skip when marketplace clone is missing, path-traversal guard when pluginRoot equals marketplace clone path, full restore of missing files, fallback to marketplace'spackage.jsonwhenpluginRoot/package.jsonis missing, args[0] rewrite for carry-forward stale paths, args[0] rewrite for${CLAUDE_PLUGIN_ROOT}placeholder, idempotent rewrite on healthyplugin.json, idempotent across re-runs, skip for non-CC layouts, no-overwrite of pre-existing user files, partial coverage when the marketplace itself is incomplete, and a full reproduction of the v1.0.150 partial-install + carry-forward fingerprint with end-to-end heal verification.start.mjsandhooks/sessionstart.mjsthat confirm the heal call shape and ordering relative to other operations.cli.tsthat confirms theexistsSync(from)check precedes thermSyncin the swap loop.Regression checks executed manually: each fix intentionally disabled produced informative test failures with the corresponding assertion. The wiring tests are anchored on literal import/call shapes (not substring
indexOfmatches) so a renamed file or commented-out call still fails the test cleanly.npm test's pretest builds tsc + bundles; the heal-related tests pass against the in-tree source without requiring the bundles.npm run typecheckis clean.Checklist
npm run typecheckpassesnpm testpasses (3708 / 3742 green; 34 pre-existing skips, none related)deriveMarketplaceClonePathtested against backslash-form input; all file operations usepath.join/path.resolve)nextbranch (unless hotfix) — PR opened against this branch; happy to retarget ifnextis preferredRepro state of the original incident, for archive
~/.claude/plugins/cache/context-mode/context-mode/1.0.150/.claude-plugin/plugin.jsonbefore heal:{ "name": "context-mode", "version": "1.0.150", ... "mcpServers": { "context-mode": { "command": "/usr/bin/bun", "args": [ "/home/sultan/.claude/plugins/cache/context-mode/context-mode/1.0.146/start.mjs" ] } } }1.0.146/did not exist on disk (cleaned bysessionstart.mjs#181 age gate).~/.claude/settings.json:Sync trigger reproduction:
claude plugin listwithautoUpdate: falsecreated~/.claude/plugins/cache/context-mode/context-mode/1.0.151/(this was a complete install, so the specific partial-copy failure mode was not reproduced; only the unconditional-sync behavior was).