Skip to content

fix(plugins): recover managed npm root after corrupt install failure (#85113)#86122

Closed
p3nchan wants to merge 2 commits into
openclaw:mainfrom
p3nchan:fix/85113-npm-root-recovery
Closed

fix(plugins): recover managed npm root after corrupt install failure (#85113)#86122
p3nchan wants to merge 2 commits into
openclaw:mainfrom
p3nchan:fix/85113-npm-root-recovery

Conversation

@p3nchan

@p3nchan p3nchan commented May 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #85113. On a managed npm root (~/.openclaw/npm), npm's Arborist can crash with ERR_INVALID_ARG_TYPE and leave the root half-broken. After that, post-update official plugin sync fails (and disableOnFailure disables the plugin), and subsequent plugins install --force keeps hitting the same corrupted root.

This adds a generic, defensive recovery for the managed npm root, and removes a misleading message on the failure path.

What changed

  • src/infra/npm-managed-root.ts
    • shouldRebuildManagedNpmRootAfterInstallFailure() — gates recovery on the specific corrupt-root npm signature only (requires ERR_INVALID_ARG_TYPE and "from" argument and Received undefined to co-occur, so registry/network/auth failures don't trigger it).
    • quarantineManagedNpmRootForRebuild() — moves only the managed root's own node_modules / package-lock.json / npm-shrinkwrap.json into _openclaw-quarantined-npm-roots/corrupt-* (all destinations inside the managed root; uses fs.rename, which does not follow symlinks; no recursive delete). Root metadata (package.json, .npmrc, pack archives) is preserved.
  • src/plugins/install.ts — on the corrupt signature: quarantine → rebuild once → retry install once. If rebuild or retry still fails, it fails closed, preserving the quarantine and surfacing both the quarantine path and the original npm error.
  • src/cli/plugins-command-helpers.ts / plugins-install-command.ts — when the primary npm install (or the recovery rebuild) fails, no longer attempt the hook-pack fallback or print the misleading Also not a valid hook pack. Manifest-type failures still keep the hook-pack fallback.

Scope / honesty

  • Recovery lives in the install path, so update / doctor / manual install all benefit, without baking post-update or any per-plugin policy into core. It is not specialized for Codex.
  • The exact npm Arborist ERR_INVALID_ARG_TYPE crash is live/package/root-state dependent — I did not reproduce it from source. This is framed as defensive recovery for a corrupt managed root; proof is via mocked failures exercising both the success (quarantine → rebuild → retry → success) and fail-closed (retry-after-recovery fails) paths.
  • Recovery is strictly bounded to rebuild-once + retry-once (no loop).

Tests

All run via node scripts/run-vitest.mjs <file> -- --reporter=verbose:

  • src/infra/npm-managed-root.test.ts — 20 passed
  • src/plugins/install.npm-spec.test.ts — 39 passed (covers quarantine→rebuild→retry success and fail-closed)
  • src/cli/plugins-cli.install.test.ts — 87 passed (incl. new regression: no hook-pack fallback / no misleading message on rebuild-failure)
  • src/cli/update-cli.test.ts — 124 passed
  • pnpm format:check on touched files — clean

Reviewed independently before submission (cross-model review), which caught and fixed a gap where the rebuild-failure error didn't match the hook-pack suppression predicate.

pingu added 2 commits May 24, 2026 23:54
…penclaw#85113)

On a managed npm root (~/.openclaw/npm), npm Arborist can crash with
ERR_INVALID_ARG_TYPE and leave the root half-broken, so post-update
official plugin sync fails and disables the plugin; subsequent installs
keep hitting the same corrupted root.

Add a generic, defensive recovery: on the observed corrupt-root npm error
signature, quarantine only the managed root's own node_modules + lockfiles
into _openclaw-quarantined-npm-roots/, rebuild once, and retry the install
once. Fail closed (preserving the quarantine for inspection) otherwise.
Not specialized for any plugin. Also stop printing the misleading
"Also not a valid hook pack" message when the primary npm install failed.
…very failure

Address review: the managed-npm-root recovery rebuild-failure error
("managed npm root recovery failed after quarantining…") did not match the
hook-pack-fallback skip predicate, so the misleading "Also not a valid hook
pack" message could still print on that path. Add the recovery-failure prefix
to shouldSkipHookPackFallbackAfterPluginInstallError, add a CLI regression
test for the rebuild-failure path, and document the managed-root no-lock
quarantine assumption.
@openclaw-barnacle openclaw-barnacle Bot added cli CLI command changes size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed May 24, 2026, 12:33 PM ET / 16:33 UTC.

Summary
The PR adds managed npm-root corruption detection/recovery, quarantines managed install artifacts before one rebuild/retry, and suppresses hook-pack fallback messaging for npm install failures.

PR surface: Source +172, Tests +303. Total +475 across 7 files.

Reproducibility: no. high-confidence live reproduction was established in this review. The linked issue has concrete logs and current-main source shows no corrupt-root rebuild path, while this PR proves the new behavior only with mocked npm failures.

Merge readiness
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🦐 gold shrimp
Result: blocked until real behavior proof from a real setup is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Preserve the npm hook-pack fallback while suppressing misleading secondary text only after fallback failure.
  • Add focused coverage for a valid npm hook pack when the plugin-first managed-root install fails.
  • Add redacted real behavior proof for quarantine, rebuild, retry, and restored plugin availability.

Proof guidance:
Needs real behavior proof before merge: The PR body says the exact Arborist crash was not reproduced and proof is mocked tests only; please add redacted terminal output, logs, screenshots, recording, or linked artifact from a real managed-root recovery, then update the PR body to trigger re-review or ask a maintainer to comment @clawsweeper re-review.

Risk before merge

  • The PR currently changes the unified npm install behavior so a valid documented npm hook pack can be blocked by a plugin-root npm failure before the hook-pack archive path is tried.
  • The managed-root recovery quarantines OpenClaw-managed node_modules and lockfiles during install/update; without real packaged or command proof, upgrade behavior is still only mocked.

Maintainer options:

  1. Preserve Hook-Pack Fallback (recommended)
    Keep trying the npm hook-pack fallback after primary plugin npm failures, and suppress the secondary hook-pack error only when that fallback also fails.
  2. Accept Plugin-First Fail-Fast
    If maintainers intentionally want primary npm install failures to block hook-pack fallback, document that compatibility change and require real recovery proof before landing.
  3. Pause For Real Recovery Proof
    Hold the PR until a redacted real managed-root recovery run demonstrates the quarantine, rebuild, retry, and post-update plugin availability path.

Next step before merge
A contributor needs to provide real behavior proof and address the hook-pack compatibility finding before this PR is merge-ready; this is not a safe repair-lane handoff while the proof must come from the contributor setup.

Security
Cleared: The diff touches install-state recovery but does not add new dependencies, workflows, secret handling, downloaded executable sources, or broader permissions beyond the existing npm install path.

Review findings

  • [P2] Preserve npm hook-pack fallback on primary install failures — src/cli/plugins-install-command.ts:352-354
Review details

Best possible solution:

Keep the bounded managed-root recovery, preserve documented hook-pack fallback semantics, and require redacted real install/update proof before merge.

Do we have a high-confidence way to reproduce the issue?

No high-confidence live reproduction was established in this review. The linked issue has concrete logs and current-main source shows no corrupt-root rebuild path, while this PR proves the new behavior only with mocked npm failures.

Is this the best way to solve the issue?

No, not yet. The recovery direction is maintainable, but returning before the hook-pack fallback is tried is not the safest compatible implementation for the documented unified installer.

Full review comments:

  • [P2] Preserve npm hook-pack fallback on primary install failures — src/cli/plugins-install-command.ts:352-354
    This return runs before tryInstallHookPackFromNpmSpec, so a documented npm hook pack can fail when the plugin-first managed npm install hits a corrupt-root or npm error even though the hook-pack path downloads the package archive separately. Keep the fallback attempt and only suppress Also not a valid hook pack when that fallback also fails.
    Confidence: 0.86

Overall correctness: patch is incorrect
Overall confidence: 0.83

Codex review notes: model gpt-5.5, reasoning high; reviewed against 79ee70c8ad8a.

Label changes

Label changes:

  • add P1: The linked bug can leave a configured Codex harness unavailable after update, affecting real agent/channel workflows.
  • add merge-risk: 🚨 compatibility: The diff changes the documented unified npm hook-pack install fallback behavior for primary plugin npm failures.
  • add merge-risk: 🚨 availability: The diff mutates managed plugin install state during recovery, and a failed or misrouted recovery can keep plugins unavailable until operator repair.
  • add rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body says the exact Arborist crash was not reproduced and proof is mocked tests only; please add redacted terminal output, logs, screenshots, recording, or linked artifact from a real managed-root recovery, then update the PR body to trigger re-review or ask a maintainer to comment @clawsweeper re-review.

Label justifications:

  • P1: The linked bug can leave a configured Codex harness unavailable after update, affecting real agent/channel workflows.
  • merge-risk: 🚨 compatibility: The diff changes the documented unified npm hook-pack install fallback behavior for primary plugin npm failures.
  • merge-risk: 🚨 availability: The diff mutates managed plugin install state during recovery, and a failed or misrouted recovery can keep plugins unavailable until operator repair.
  • rating: 🧂 unranked krab: Overall readiness is 🧂 unranked krab; proof is 🧂 unranked krab and patch quality is 🦐 gold shrimp.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body says the exact Arborist crash was not reproduced and proof is mocked tests only; please add redacted terminal output, logs, screenshots, recording, or linked artifact from a real managed-root recovery, then update the PR body to trigger re-review or ask a maintainer to comment @clawsweeper re-review.
Evidence reviewed

PR surface:

Source +172, Tests +303. Total +475 across 7 files.

View PR surface stats
Area Files Added Removed Net
Source 4 173 1 +172
Tests 3 304 1 +303
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 7 477 2 +475

What I checked:

  • PR recovery path: The PR detects the specific npm error signature, runs managed-root recovery, then retries the npm install once before failing with a quarantine path. (src/plugins/install.ts:814, 93d00f311ec9)
  • PR hook fallback skip: The PR returns immediately for primary npm install errors before calling tryInstallHookPackFromNpmSpec. (src/cli/plugins-install-command.ts:352, 93d00f311ec9)
  • Documented hook-pack install surface: Current docs say plugins install is also the install surface for npm hook packs that expose openclaw.hooks. Public docs: docs/cli/plugins.md. (docs/cli/plugins.md:169, 79ee70c8ad8a)
  • Hook npm fallback uses separate archive path: installHooksFromNpmSpec downloads and installs from a validated npm package archive, so it can be independent of the managed plugin npm root failure that the plugin-first probe hit. (src/hooks/install.ts:406, 79ee70c8ad8a)
  • Current main fallback order: Current main always attempts the npm hook-pack fallback after a non-terminal plugin npm failure, then formats both errors only if that fallback also fails. (src/cli/plugins-install-command.ts:350, 79ee70c8ad8a)
  • Linked issue evidence: The linked issue comments report repeated managed-root npm failures, manual managed-root rebuild recovery, and successful plugin doctor/status afterward.

Likely related people:

  • vincentkoc: Current local blame for the managed npm install path and npm-managed-root helpers resolves to a recent commit by Vincent Koc, and public history shows related plugin install hardening work. (role: recent area contributor; confidence: medium; commits: 75ac11aca267, d2e0a8231f88, 89a9b4e75acd; files: src/plugins/install.ts, src/infra/npm-managed-root.ts, src/cli/plugins-command-helpers.ts)
  • steipete: Public path history shows multiple recent plugin install, npm pack, and managed-root commits by steipete touching the affected install and hook-pack surfaces. (role: recent managed-install contributor; confidence: medium; commits: 97d1f5fd153c, 86faf654db65, 2eaf8ad7126b; files: src/plugins/install.ts, src/infra/npm-managed-root.ts, src/cli/plugins-command-helpers.ts)
  • jalehman: Public path history shows jalehman authored the recent managed npm freshness bypass work that changed the same managed install surface. (role: managed npm install contributor; confidence: medium; commits: 85a3d5312f7d; files: src/plugins/install.ts, src/infra/npm-managed-root.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P1 High-priority user-facing bug, regression, or broken workflow. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. labels May 24, 2026
@clawsweeper

clawsweeper Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@omarshahine

Copy link
Copy Markdown
Contributor

Thanks for the PR. Before this can move forward, please add live proof from the affected surface, not just unit tests, mocked tests, or source inspection.

A useful proof update should include:

  • the exact build/SHA tested
  • the real environment used
  • the command, UI flow, channel flow, provider request, or other live path exercised
  • before/after symptom evidence where applicable
  • the observed result after the patch
  • any remaining proof gaps

Please redact secrets, tokens, phone numbers, and private message content from logs or screenshots.

@BingqingLyu

This comment was marked as spam.

@steipete steipete self-assigned this Jun 2, 2026
@steipete

steipete commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Thanks @p3nchan. I landed a reworked fix on main in a326faa10c.

This PR was the closest report/fix shape for the npm managed-root corruption, but current main has since moved managed npm installs to per-plugin project roots, so the flat-root rebuild approach here was no longer the right owner boundary. The landed fix detects npm's ERR_INVALID_ARG_TYPE/"from" argument corruption signature in the per-plugin project, quarantines node_modules and lockfiles, retries once, and keeps rebuilt dependency scanning intact. It also separately covers the Windows native-module EPERM swap cleanup path by keeping the successful new install while surfacing preserved old backups for delayed cleanup.

Proof:

  • local focused Vitest for plugin install, package update, and release-check fallback tests;
  • Testbox-through-Crabbox changed gate: tbx_01kt43arnm9kg7h284vfpshc0k, Actions run https://github.com/openclaw/openclaw/actions/runs/26818339001;
  • native Windows AWS Crabbox: cbx_68a3fd5bc935 / run_05f5965d70da, EPERM package-swap regression passed and Windows release-check fallback regressions passed.

Closing this PR as superseded by the landed main fix.

@steipete steipete closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes merge-risk: 🚨 availability 🚨 May cause crashes, hangs, restart loops, stalls, or process outages. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P1 High-priority user-facing bug, regression, or broken workflow. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: M status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

v2026.5.20 post-update Codex plugin sync can wedge managed npm root with npm ERR_INVALID_ARG_TYPE

4 participants