Skip to content

fix(hooks): self-heal partial plugin cache install from marketplace clone#699

Merged
mksglu merged 2 commits into
mksglu:nextfrom
kerneltoast:fix/heal-partial-install-from-marketplace
May 31, 2026
Merged

fix(hooks): self-heal partial plugin cache install from marketplace clone#699
mksglu merged 2 commits into
mksglu:nextfrom
kerneltoast:fix/heal-partial-install-from-marketplace

Conversation

@kerneltoast

@kerneltoast kerneltoast commented May 24, 2026

Copy link
Copy Markdown
Contributor

What / Why / How

What: A new hooks/heal-partial-install.mjs that detects a partial plugin cache install on every session start and re-copies missing files from the marketplace clone Claude Code maintains, plus a back-ported existsSync guard on cli.ts's /ctx-upgrade swap loop that closes a related latent vector.

Why (direct observations from the reproducer machine; everything in this section is fact, not speculation):

The per-version plugin cache dir at ~/.claude/plugins/cache/context-mode/context-mode/1.0.150/ held 73 files of the marketplace's tracked tree. Present: hooks/, .claude-plugin/, .cursor-plugin/, .pi/, web/, plus a few docs. Absent: cli.bundle.mjs, server.bundle.mjs, start.mjs, package.json, src/, bin/, scripts/, skills/, .codex-plugin/, .openclaw-plugin/.

The cache's .claude-plugin/plugin.json was a verbatim carry-forward from the user's previous 1.0.146 install, with mcpServers.command = "/usr/bin/bun" and mcpServers.args[0] referencing the 1.0.146 cache dir (which had been age-gate-cleaned by hooks/sessionstart.mjs (#181) and no longer existed). MCP launch ENOENTs every spawn.

None of the existing defenses repairs this:

So the cache cannot self-recover, and the only available workaround from the user side is rm -rf <version-dir> plus a session restart. The user reported hitting this several times across upgrade cycles.

What I could verify directly about the producing code path (also fact): Claude Code runs a plugin sync regardless of the marketplace's autoUpdate setting. With "autoUpdate": false in ~/.claude/settings.json for the mksglu/context-mode marketplace, a read-only claude plugin list invocation still created a new per-version cache dir matching the marketplace's current version. So autoUpdate=false does not opt out of CC's "ensure cache matches marketplace version" sync.

What I cannot verify (speculation, flagged explicitly): which CC code path produced this particular fingerprint (partial copy plus carry-forward plugin.json). The reproducer's /ctx-upgrade transcripts show this project's own cli.ts upgrade() took its "Already on latest" early-return path and the rm/cp loop did not run, so that's ruled out as the producer here. A clean-marketplace claude plugin list repro on my machine produced a complete cache install, not a partial one, so the specific failure mode needs additional conditions I could not pin down. Three adjacent upstream bug reports describe overlapping failure modes in CC's plugin sync: anthropics/claude-code#48912 (Update fails to materialize newly-added top-level directories), #57645 (installPath in installed_plugins.json lags cache extraction non-atomically), and #57646 (gitCommitSha in the registry never refreshed). I'm not asserting any one of those bugs is the exact mechanism in this instance; they are cited as documented adjacent failure modes in the same subsystem.

How: The heal module ships in hooks/ rather than scripts/ because the failure mode itself can strip scripts/ from the cache dir; hooks/ has been intact in every observed instance. Wired from hooks/sessionstart.mjs (primary trigger; fires even when MCP can't boot) and from start.mjs (belt-and-braces, runs before the Algo-D4 integrity gate so a fixable partial install is repaired rather than just reported).

Detection is a cheap 4-existsSync probe, so the healthy case is effectively free. On a trip, the module reads package.json files[] (prefers pluginRoot's, falls back to the marketplace's when pluginRoot/package.json is one of the missing files), expands to a concrete file list against the marketplace clone, and copies each entry that's missing in pluginRoot. After files land, it rewrites any carry-forward stale mcpServers.args[0] in plugin.json to the current pluginRoot, mirroring the #604 ratchet. The module is pure JS (Node built-ins only), never throws, idempotent, and path-traversal-guarded.

The secondary commit closes a separate partial-install vector on the /ctx-upgrade path: cli.ts:1207-1212's try { rmSync; cpSync } catch {} would swallow cpSync failures (or files[] entries that don't exist in the cloned source) and leave files deleted without replacement. The safe pattern (if (!existsSync(from)) continue) is already used in server.ts:3820's inline-fallback path; this back-ports it.

Caveat (vN+1 only): This lands in vN+1 and heals broken installs from that point forward. Users currently hitting the failure need to manually rm -rf their cache dir once to clear the old state. From then on the heal keeps the install healthy across all future CC sync cycles.

Fixes: this user's recurring partial-install issue. Related upstream (in CC's plugin sync, not this project): anthropics/claude-code#48912, anthropics/claude-code#57645, anthropics/claude-code#57646.

Affected platforms

  • Claude Code
  • Cursor
  • VS Code Copilot (GitHub Copilot)
  • JetBrains Copilot
  • Gemini CLI
  • Qwen Code
  • OpenCode
  • KiloCode
  • Codex CLI
  • OpenClaw (Pi Agent)
  • Pi
  • Kiro
  • Antigravity
  • Zed
  • All platforms

The heal probe and copy logic are CC-cache-layout-specific (<config>/plugins/cache/<owner>/<plugin>/<version>/). deriveMarketplaceClonePath returns null for any other layout (npm-global, dev checkout, opencode cache, ...) and the heal becomes a no-op. The secondary cli.ts guard is platform-agnostic since the upgrade flow is shared.

Test plan

26 new tests added to tests/core/cli.test.ts (folded into existing partial-install describe block per CONTRIBUTING L282; no new test files). Coverage:

  • isPartialInstall cheap probe across the launch-critical file matrix (healthy, missing start.mjs, missing package.json, missing both cli.bundle.mjs + build/cli.js, accepting build/cli.js as fallback, falsy input).
  • deriveMarketplaceClonePath layout helper across POSIX paths, trailing slashes, Windows backslashes, npm-global, dev checkout, and falsy input.
  • healPartialInstallFromMarketplace covering: early-return on healthy install, skip when marketplace clone is missing, path-traversal guard when pluginRoot equals marketplace clone path, full restore of missing files, fallback to marketplace's package.json when pluginRoot/package.json is missing, args[0] rewrite for carry-forward stale paths, args[0] rewrite for ${CLAUDE_PLUGIN_ROOT} placeholder, idempotent rewrite on healthy plugin.json, idempotent across re-runs, skip for non-CC layouts, no-overwrite of pre-existing user files, partial coverage when the marketplace itself is incomplete, and a full reproduction of the v1.0.150 partial-install + carry-forward fingerprint with end-to-end heal verification.
  • Wiring assertions on start.mjs and hooks/sessionstart.mjs that confirm the heal call shape and ordering relative to other operations.
  • A guard assertion on cli.ts that confirms the existsSync(from) check precedes the rmSync in the swap loop.

Regression checks executed manually: each fix intentionally disabled produced informative test failures with the corresponding assertion. The wiring tests are anchored on literal import/call shapes (not substring indexOf matches) so a renamed file or commented-out call still fails the test cleanly.

npm test's pretest builds tsc + bundles; the heal-related tests pass against the in-tree source without requiring the bundles. npm run typecheck is clean.

Checklist

  • Tests added/updated (TDD: red → green; intentional regression breaks verified)
  • npm run typecheck passes
  • npm test passes (3708 / 3742 green; 34 pre-existing skips, none related)
  • Docs updated if needed (README, platform-support.md) — N/A, no user-facing surface area changes
  • No Windows path regressions (deriveMarketplaceClonePath tested against backslash-form input; all file operations use path.join / path.resolve)
  • Targets next branch (unless hotfix) — PR opened against this branch; happy to retarget if next is preferred
Repro state of the original incident, for archive

~/.claude/plugins/cache/context-mode/context-mode/1.0.150/.claude-plugin/plugin.json before heal:

{
  "name": "context-mode",
  "version": "1.0.150",
  ...
  "mcpServers": {
    "context-mode": {
      "command": "/usr/bin/bun",
      "args": [
        "/home/sultan/.claude/plugins/cache/context-mode/context-mode/1.0.146/start.mjs"
      ]
    }
  }
}

1.0.146/ did not exist on disk (cleaned by sessionstart.mjs #181 age gate).

~/.claude/settings.json:

"marketplaces": {
  "context-mode": {
    "url": "https://github.com/mksglu/context-mode.git",
    "autoUpdate": false
  },
  ...
}

Sync trigger reproduction: claude plugin list with autoUpdate: false created ~/.claude/plugins/cache/context-mode/context-mode/1.0.151/ (this was a complete install, so the specific partial-copy failure mode was not reproduced; only the unconditional-sync behavior was).

@NgoQuocViet2001 NgoQuocViet2001 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual Windows + Codex review notes:

  • Targeted heal tests passed locally.
  • pnpm run typecheck and pnpm run build passed locally.
  • I also reproduced a partial Claude plugin cache layout manually and confirmed the heal restores start.mjs, cli.bundle.mjs, server.bundle.mjs, clears the partial probe, and rewrites stale args[0] to the current version.

One path-containment guard concern inline before merge.

const out = new Set();
for (const entry of files) {
if (typeof entry !== "string" || !entry) continue;
const abs = join(rootDir, entry);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heal flow works in my manual partial-install repro, but I think we need one more containment guard around files[] entries before merge. I tested a malformed marketplace package.json with files: ["../outside.txt", ...]; expandFilesArray() accepts it here, and the later copy loop writes outside pluginRoot. Could we reject absolute / .. entries, or resolve both from and to and ensure they stay under marketplaceClonePath and pluginRoot respectively before copying? That keeps a corrupted marketplace clone from turning the self-heal into an out-of-root write.

@kerneltoast kerneltoast May 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude (Anthropic, Opus 4.7) authoring this reply on Sultan's behalf.

Thanks for the catch! ..-escape entries are now rejected in expandFilesArray via lexical resolve(abs).startsWith(rootDir + sep), and both resolve(from) and resolve(to) are re-checked against their roots at the copy site as defense in depth.

Follow-up adversarial review surfaced a few adjacent vectors that landed in the same PR. On the source side we use lstat(from).isSymbolicLink() to drop leaf symlinks and realpathSync(from).startsWith(realpathSync(marketplaceClonePath) + sep) to collapse ancestor symlinks (e.g. a marketplace scripts/ dir swapped for a symlink to outside). On the destination side we lstat(to) and unlinkSync any pre-existing symlink, and the write itself uses writeFileSync(..., {flag: "wx"}) so a re-planted symlink at to after the unlink can't be followed via the open. plugin.json reads go through openSync(O_RDONLY | O_NOFOLLOW) + readFileSync(fd) so a symlink-redirected read can't feed attacker JSON into the args rewrite, and the rewrite itself writes to a 64-bit-random tmp suffix under flag: "wx" plus renameSync for atomicity.

We also added an explicit not-claude-code skip for any pluginRoot that doesn't match the CC cache layout, so non-CC clients (Codex, Cursor, OpenCode, Kiro, ...) bail before any FS work runs. The Contract block at the top of hooks/heal-partial-install.mjs enumerates the layered guards in full.

Test count went from 191 to 198 with regressions for each closed vector.

@kerneltoast kerneltoast force-pushed the fix/heal-partial-install-from-marketplace branch 2 times, most recently from 620d666 to 9d39497 Compare May 25, 2026 21:46

@NgoQuocViet2001 NgoQuocViet2001 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerneltoast Re-tested after your containment fix. The original files[] traversal concern looks addressed to me.

Checked on Windows:

  • Normal partial cache repro heals start.mjs, cli.bundle.mjs, server.bundle.mjs, and package.json, clears the partial probe, and rewrites stale args[0] to the current version.
  • Malformed marketplace package.json with files: ["../OUTSIDE.txt", ...] is rejected without writing outside pluginRoot; legitimate entries still heal.
  • Non-Claude-Code plugin roots skip with not-claude-code.

Ran:

  • node node_modules/typescript/bin/tsc --noEmit
  • npm run build
  • node node_modules/vitest/vitest.mjs run tests/core/cli.test.ts

One non-blocking Windows-local note: on my machine without symlink privilege / Developer Mode, 3 symlink-specific tests fail at symlinkSync(..., ...) with EPERM. The PR's Windows CI is green, so I don't see it as a merge blocker; just something to keep in mind for local Windows dev boxes.

kerneltoast added a commit to kerneltoast/context-mode that referenced this pull request May 26, 2026
The /ctx-upgrade swap loop in cli.ts and the inline-fallback variant
in server.ts both iterate pkg.files[] read straight out of the
freshly cloned upstream package.json and hand each entry to rmSync
plus cpSync without checking that the resolved destination stays
inside pluginRoot. A compromised upstream tag at
kerneltoast/context-mode (cli.ts) or mksglu/context-mode (server.ts
inline) shipping
  "files": ["../../.ssh/authorized_keys", ...]
would, on the next /ctx-upgrade, delete and overwrite arbitrary
files under the user's UID. cli.ts is additionally vulnerable to
absolute-path entries since path.resolve(P, "/abs") discards P;
server.ts uses path.join, which defuses that variant but still
follows relative "..".

Fix it by mirroring the lexical containment pattern that
hooks/heal-partial-install.mjs (mksglu#699) already uses for its own
files[] expansion: resolve(rootDir) + sep, then drop items whose
resolved path doesn't startsWith it. Both ends are checked; the
"from" side blocks symlink-anchor variants where files[] is benign
but a symlink inside the clone redirects. server.ts's inline script
also switches from join() to resolve() so absolute-path entries
normalize to themselves and then fail the pluginRoot prefix check.

Tests in tests/core/cli.test.ts verify the guard appears at both
sites and exercise the algorithm against a sandbox tmpdir tree
where the malicious items target a planted victim file outside
pluginRoot.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mksglu

mksglu commented May 31, 2026

Copy link
Copy Markdown
Owner

Thanks @kerneltoast. Conflicting on next now after your own security PR #716 landed — both touched src/cli.ts swap loop.

The conflict is benign: #716 added refuseSymlinks filter, #699 adds existsSync(from) guard. They're compatible, just need textual reorder. Please rebase on next.

Verdict was MERGE_AS_IS — re-ping after rebase. The 712-LoC heal module + 6-layer path-traversal defense in hooks/heal-partial-install.mjs is ready as-is; only src/cli.ts needs the replay.

kerneltoast and others added 2 commits May 31, 2026 16:38
…on start

What was observed on the reproducer machine, directly from the
filesystem and session logs:

The per-version cache dir at
~/.claude/plugins/cache/context-mode/context-mode/1.0.150/ held 73
files of the marketplace's tracked tree. hooks/, .claude-plugin/,
.cursor-plugin/, .pi/, web/ and a few docs were present;
cli.bundle.mjs, server.bundle.mjs, start.mjs, package.json, src/,
bin/, scripts/, skills/, .codex-plugin/ and .openclaw-plugin/ were
absent. The cache's .claude-plugin/plugin.json was a verbatim
carry-forward from a previous 1.0.146 install, with
mcpServers.command = "/usr/bin/bun" and mcpServers.args[0]
referencing the 1.0.146 cache dir, which had been age-gate-cleaned
by hooks/sessionstart.mjs (mksglu#181) and no longer existed on disk.

None of the existing defenses repair this:

  - scripts/plugin-cache-integrity.mjs (mksglu#550) exits 2 from start.mjs,
    but start.mjs is one of the missing files.
  - The mksglu#604 stale-cache-version ratchet in hooks/normalize-hooks.mjs
    runs from start.mjs's normalize-hooks call, same boot-time
    dependency.
  - /ctx-upgrade needs the cli.bundle.mjs the partial install lost,
    so the user-facing escape hatch is gone too.

So the cache cannot self-recover, and the only available workaround
from the user side is `rm -rf <version-dir>` plus a session restart.
The user reported hitting this several times across upgrade cycles.

What I could verify directly about the producing code path: Claude
Code runs a plugin sync regardless of the marketplace's autoUpdate
setting. With `"autoUpdate": false` in ~/.claude/settings.json for the
mksglu/context-mode marketplace, a read-only `claude plugin list`
invocation still created a new per-version cache dir matching the
marketplace's current version. So autoUpdate=false does not opt out
of CC's "ensure cache matches marketplace version" sync.

What I cannot verify: which CC code path produced THIS particular
fingerprint (partial copy plus carry-forward plugin.json). The
reproducer's /ctx-upgrade transcripts show the cli.ts upgrade()
rm/cp loop took its "Already on latest" early-return path and did
not run, so that's not the producer here. A clean-marketplace
`claude plugin list` repro on my machine produced a *complete*
cache install, not a partial one, so the specific failure mode
needs additional conditions I could not pin down. Three adjacent
upstream bug reports describe overlapping failure modes in CC's
plugin sync: anthropics/claude-code#48912 (Update fails to
materialize newly-added top-level directories), #57645 (installPath
in installed_plugins.json lags cache extraction non-atomically),
and #57646 (gitCommitSha in the registry never refreshed on
/plugin update). I'm NOT asserting any one of those bugs is the
exact mechanism in this instance; I'm citing them as documented
adjacent failure modes in the same subsystem.

Fix it by adding hooks/heal-partial-install.mjs. The module detects
partial install on every session start (cheap 4-existsSync probe
keeps the healthy case effectively free) and re-copies missing
files from the marketplace clone CC maintains at
~/.claude/plugins/marketplaces/<owner>/. The clone is CC's canonical
source for the per-version cache dir when its sync works correctly,
so trusting it sidesteps whichever sync path produced the partial
copy. We also rewrite the carry-forward stale args[0] in plugin.json
(mirroring the mksglu#604 ratchet) so the next MCP launch targets the
current pluginRoot. Both halves are mechanism-agnostic; they don't
depend on knowing which CC code path created the broken state.

Beyond the core re-copy, the module's destructive paths are hardened
against path traversal and symlink attacks. files[] entries that
resolve outside rootDir are dropped at expansion. Directory walks
use lstat and skip symlinks, so a planted symlink in the marketplace
tree can't be harvested as a regular file. Just before each copy, we
lstat plus realpath on the source to catch both leaf and ancestor
symlinks, lstat the destination to unlink any pre-existing symlink,
and use writeFileSync with `flag: "wx"` so a re-planted symlink at
the leaf can't be followed via the open. plugin.json reads go
through openSync with O_NOFOLLOW so a redirected read can't feed
attacker JSON into the args rewrite, and the rewrite itself writes
to an unguessable random-suffixed tmp file plus renameSync for
atomicity. The Contract block in the module header enumerates the
layered guards in full.

The heal is also scoped to CC's per-version cache layout: any
pluginRoot that doesn't match
~/.claude/plugins/cache/<owner>/<plugin>/<version>/ bails
immediately with `skipped: "not-claude-code"`, before any FS work
runs. Other clients (Codex, Cursor, OpenCode, Kiro, ...) ship their
own SessionStart wrappers under hooks/<client>/ and don't go through
this module.

The heal lives in hooks/ rather than scripts/ because the failure
mode itself can strip scripts/ from the cache dir; hooks/ has been
intact in every observed instance. Wired from hooks/sessionstart.mjs
(primary trigger; fires even when the MCP server can't boot) and
from start.mjs (belt-and-braces, runs before the Algo-D4 integrity
gate so a fixable partial install is repaired rather than just
reported).

Note that this lands in vN+1 and heals broken installs from that
point forward. Users currently hitting the failure need to manually
rm -rf their cache dir once to clear the old state. From then on the
heal keeps the install healthy across all future CC sync cycles.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cli.ts's upgrade() rm/cp loop swallows cpSync failures via a catch
that was meant to absorb "some files may not exist in source". The
catch fires after rmSync has already run, though, so any cp failure
(or a `files[]` entry that doesn't exist in the cloned source tree)
leaves the cache dir with a deleted file and no replacement. That's
a partial-install vector on the /ctx-upgrade path, separate from
the broader CC-side partial install that the new heal in
hooks/heal-partial-install.mjs backstops.

server.ts's inline-fallback upgrade (the rescue path used when
neither cli.bundle.mjs nor build/cli.js exists at install time)
already guards with `if (existsSync(from))` before any rm/cp.
Back-port that same check to cli.ts so the safe pattern covers
both entrypoints.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mksglu mksglu force-pushed the fix/heal-partial-install-from-marketplace branch from a63d96c to ed2d089 Compare May 31, 2026 13:39
@mksglu mksglu merged commit 606383e into mksglu:next May 31, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants