Skip to content

fix(rebuild): handle partial tar backup when files are root-owned#2734

Merged
ericksoa merged 6 commits into
NVIDIA:mainfrom
kagura-agent:fix/rebuild-partial-tar-2727
May 4, 2026
Merged

fix(rebuild): handle partial tar backup when files are root-owned#2734
ericksoa merged 6 commits into
NVIDIA:mainfrom
kagura-agent:fix/rebuild-partial-tar-2727

Conversation

@kagura-agent

@kagura-agent kagura-agent commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #2727.

nemoclaw rebuild aborts when tar encounters root-owned files inside .openclaw-data. GNU tar exits with code 2 (errors encountered) but still writes all accessible files to stdout. The backup code previously treated any non-zero tar exit as a complete failure, marking all state dirs as failed and aborting the rebuild — even though the archive contained most of the data.

Changes

src/lib/sandbox-state.ts

  • Accept tar exit codes 0, 1, and 2 when stdout contains data (previously only exit 0)
  • After extraction, check which dirs actually landed on disk (existsSync) to distinguish per-dir success from failure — instead of marking all dirs as failed on any tar error

src/nemoclaw.ts

  • Only abort rebuild on total backup failure (zero dirs backed up)
  • When some dirs succeeded (partial backup), proceed with a warning so a handful of root-owned files don't block the entire rebuild

Behavior change

Scenario Before After
tar exit 0 (all files OK) ✅ Rebuild proceeds ✅ Rebuild proceeds (no change)
tar exit 2, some dirs backed up ❌ Rebuild aborts ⚠️ Warning + rebuild proceeds
tar exit 2, zero dirs backed up ❌ Rebuild aborts ❌ Rebuild aborts (no change)
tar exit >2 or no stdout ❌ Rebuild aborts ❌ Rebuild aborts (no change)

How to test

  1. Exec into a sandbox as root and create a permission-denied file:
    kubectl exec <sandbox> -- sh -c 'echo "{}" > /sandbox/.openclaw-data/agents/main/sessions/sessions.json && chmod 600 /sandbox/.openclaw-data/agents/main/sessions/sessions.json'
  2. Run nemoclaw <name> rebuild --yes
  3. Before: rebuild aborts with "Failed to back up sandbox state"
  4. After: rebuild proceeds with a partial-backup warning

Summary by CodeRabbit

  • Bug Fixes

    • Salvage archives even when the backup tool exits non‑zero but produced data; attempts partial extraction.
    • Extraction now verifies actual extracted dirs: only verified dirs are recorded as backed up; read‑error or missing dirs are excluded and reported.
    • Rebuild no longer aborts on partial backup failure — continues with warnings listing counts and names of failed dirs/files.
    • Restores prefer manifest-recorded backed-up directories to restore only verified data.
  • Tests

    • Added tests covering partial/failed backup and restore semantics and unsafe/invalid manifest cases.

Signed-off-by: Aaron Erickson aerickson@nvidia.com

@copy-pr-bot

copy-pr-bot Bot commented Apr 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

SSH+tar backups now accept GNU tar exit codes 0/1/2 when archive bytes are produced, attempt partial extraction, attribute read errors to specific state directories, and persist verified backedUpDirs in the manifest. Restore prefers backedUpDirs. Rebuild only aborts if nothing was backed up; partial backups warn and continue.

Changes

Backup extraction, manifest, and rebuild flow

Layer / File(s) Summary
Data shape / Manifest
src/lib/sandbox-state.ts
RebuildManifest gains optional backedUpDirs?: string[]; manifest.backedUpDirs is written after backup.
Validation / Utilities
src/lib/sandbox-state.ts
Adds isSafeStateDirPath, isStateDirArray, isBackedUpDirArray, and existingBackupDirs to validate paths and list existing dirs under a backup.
SSH+tar handling
src/lib/sandbox-state.ts
Treats tar exit codes 0,1,2 with non-empty stdout as eligible for extraction (tarExitedWithData) and logs an “attempting partial extraction” message for non-zero exits with data.
Extraction / Attribution
src/lib/sandbox-state.ts
Runs safeTarExtract on received archive bytes; computes extractedDirs from disk; uses failedDirsFromTarStderr(stderr, existingDirs) to attribute read failures to specific dirs; for non-attributed partial results, missing dirs are marked failed (or all failed if no attribution).
Restore wiring
src/lib/sandbox-state.ts
restoreSandboxState selects restorable dirs from manifest.backedUpDirs ?? manifest.stateDirs and filters by existingBackupDirs(backupPath, ...) before restoring.
Rebuild orchestration / logging
src/nemoclaw.ts
Abort-on-backup-failure changed: only abort if backup failed and no backed-up dirs/files exist; otherwise log warnings listing backed/failed dirs/files and continue; unchanged full-success path preserved.
Tests / Fixtures
test/snapshot.test.ts
New helpers writeOpenClawRegistry, writeFakeOpenshell; expanded listBackups tests for unsafe or non-subset backedUpDirs; new sandbox directory backup semantics tests simulate tar success, tar-per-dir failures, and non-attributed partial extractions to verify manifest.backedUpDirs, failedDirs attribution, and restore behavior.

Sequence Diagram

sequenceDiagram
  participant CLI
  participant Sandbox as "Sandbox (SSH+tar)"
  participant Extractor as "Local extractor\n(safeTarExtract + FS)"
  participant Manifest as "Manifest store\n(write/read)"

  CLI->>Sandbox: run tar -cf - -C /sandbox/.openclaw-data <stateDirs>
  Sandbox-->>CLI: stdout=archive bytes, stderr=tar errors, exit=0|1|2
  CLI->>Extractor: pass archive bytes to safeTarExtract
  Extractor->>Extractor: extract files -> compute extractedDirs (FS)
  Extractor->>Extractor: parse tar stderr -> attributedFailedDirs
  Extractor->>Manifest: compute backedUpDirs and failedDirs
  Manifest-->>Extractor: persist manifest.backedUpDirs
  CLI->>Manifest: on restore, read manifest.backedUpDirs ?? stateDirs
  CLI->>Extractor: restore only existing backedUpDirs from backupPath
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I munched through tar crumbs, whiskers full of bits,

Some burrows were blocked, some treasures still fit.
I flagged the dark tunnels, kept the bright in store,
Hopped on with a warning, and backed up a few more.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix(rebuild): handle partial tar backup when files are root-owned' accurately describes the main change: handling partial tar backups when encountering root-owned files.
Linked Issues check ✅ Passed The PR comprehensively addresses all objectives from issue #2727: treats tar exits 1/2 as partial success when stdout has data, marks per-directory results via readdirSync verification, allows rebuild to continue with warnings on partial backups, and only aborts on total backup failure.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the partial tar backup handling: modifications to sandbox-state.ts backup logic, nemoclaw.ts rebuild abort conditions, RebuildManifest interface, and corresponding test coverage for the new behavior.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/sandbox-state.ts`:
- Around line 766-776: The check that treats a directory as successfully backed
up if existsSync(path.join(backupPath, d)) is too weak because tar can create an
empty directory header; update the logic in the loop that populates
backedUpDirs/failedDirs (references: existingDirs, backedUpDirs, failedDirs) to
verify actual archived content under that directory instead of just directory
existence—e.g., inspect the extracted directory tree at path.join(backupPath, d)
(using readdir/traverse) and only mark as backed up if it contains at least one
regular file or non-empty subtree; otherwise log and push into failedDirs so
sandboxRebuild() behaves correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4349f3fa-12ab-48d2-96f1-f471fce921d6

📥 Commits

Reviewing files that changed from the base of the PR and between 16794d5 and 9afa0f2.

📒 Files selected for processing (2)
  • src/lib/sandbox-state.ts
  • src/nemoclaw.ts

Comment thread src/lib/sandbox-state.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/sandbox-state.ts (1)

766-775: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

readdirSync(...).length > 0 is still the wrong success signal here.

Line 774 only proves the extracted top-level directory has an entry, not that any restorable file data survived. tar can still leave you with nested empty directory headers and make this path look successful, while a legitimately empty state dir now gets marked failed. That means backedUpDirs.length can still be wrong in both directions, which is risky because src/nemoclaw.ts:3239-3260 uses it to decide whether rebuild aborts or proceeds.

Please switch this to a stronger check: recursively confirm at least one extracted file exists, and separately treat source dirs that were known-empty before backup as successful.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72879221-de48-465e-a1d5-dd23b0fd231d

📥 Commits

Reviewing files that changed from the base of the PR and between 9afa0f2 and 5f15266.

📒 Files selected for processing (1)
  • src/lib/sandbox-state.ts

@wscurran wscurran added bug Something fails against expected or documented behavior NemoClaw CLI dependencies Pull requests that update a dependency file labels Apr 30, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that fixes a bug where nemoclaw rebuild aborts when encountering root-owned files. This proposes a change to the backup code to handle partial tar backups and proceed with a warning when some dirs succeed.


Related open issues:

@wscurran wscurran removed the dependencies Pull requests that update a dependency file label Apr 30, 2026
@kagura-agent

Copy link
Copy Markdown
Contributor Author

Re: CodeRabbit's concern about checking actual archived content vs directory existence —

The current implementation already handles this. Lines 773-778 use readdirSync(dirPath).length > 0 rather than existsSync() to verify that the directory contains actual extracted files, not just an empty tar header:

let hasContent = false;
try {
  hasContent = readdirSync(dirPath).length > 0;
} catch {
  /* dir doesn't exist at all */
}
if (hasContent) {
  backedUpDirs.push(d);
} else {
  _log(`Dir ${d} empty or missing from backup after extraction — marking failed`);
  failedDirs.push(d);
}

This correctly distinguishes between an empty directory header (tar exit 2 with permission-denied children) and a directory with actual backed-up content.

…IDIA#2727)

GNU tar exits with code 2 when it encounters permission-denied errors,
but still writes all accessible files to stdout.  The backup code
previously treated any non-zero tar exit as a complete failure, marking
all state dirs as failed and aborting the rebuild.

Changes:
- sandbox-state.ts: Accept tar exit codes 0, 1, and 2 when stdout
  contains data.  After extraction, check which dirs actually landed
  on disk to distinguish per-dir success from failure.
- nemoclaw.ts: Only abort rebuild on total backup failure (zero dirs
  backed up).  When some dirs succeeded, proceed with a warning so a
  handful of root-owned files don't block the entire rebuild.
Address CodeRabbit feedback: tar can create empty directory headers
even when all files inside were permission-denied.  Use readdirSync
to confirm at least one entry exists before marking a dir as backed up.
@kagura-agent kagura-agent force-pushed the fix/rebuild-partial-tar-2727 branch from 5f15266 to a333331 Compare May 4, 2026 12:26

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/sandbox-state.ts (1)

1032-1048: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Verify extracted directories before marking them as backed up.

This branch marks every non-attributed directory as successful, but nothing checks that the directory actually landed under backupPath after extraction. If tar aborts before later args are written, or stderr attribution misses a path, manifest.backedUpDirs can contain false positives and rebuild will proceed on an incomplete backup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/sandbox-state.ts` around lines 1032 - 1048, The current branch marks
any directory in existingDirs that isn't listed in tarFailedDirs as backed up
without confirming it actually exists under backupPath; change the logic in the
block that handles tarFailedDirs (around failedDirsFromTarStderr, existingDirs,
backedUpDirs, and backupPath) to verify each candidate directory was actually
extracted to backupPath (e.g., check filesystem presence or expected marker)
before pushing it into backedUpDirs/manifest.backedUpDirs, and if the directory
is missing treat it as failed (push to failedDirs and log an error), preserving
the existing tar-failure handling and logs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/sandbox-state.ts`:
- Around line 167-168: The manifest validation currently only checks
backedUpDirs with isStringArray and then later uses it in the tar restore, which
allows path traversal; update validation around backedUpDirs to mirror
stateFiles checks: normalize each entry (using path.normalize/posix), reject
absolute paths and any entry that includes ".." or resolves outside the expected
base, and then intersect the validated backedUpDirs with stateDirs (or the
normalized list used for stateDirs) so only known state directories are allowed
as restore sources; apply the same stricter validation at both locations
referenced (the isStringArray check at backedUpDirs and the separate check
around lines 1148-1149) and ensure any failure causes manifest rejection rather
than trusting the input for the tar command.

---

Duplicate comments:
In `@src/lib/sandbox-state.ts`:
- Around line 1032-1048: The current branch marks any directory in existingDirs
that isn't listed in tarFailedDirs as backed up without confirming it actually
exists under backupPath; change the logic in the block that handles
tarFailedDirs (around failedDirsFromTarStderr, existingDirs, backedUpDirs, and
backupPath) to verify each candidate directory was actually extracted to
backupPath (e.g., check filesystem presence or expected marker) before pushing
it into backedUpDirs/manifest.backedUpDirs, and if the directory is missing
treat it as failed (push to failedDirs and log an error), preserving the
existing tar-failure handling and logs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54f3c417-e608-4379-84bb-77f341d1bc4a

📥 Commits

Reviewing files that changed from the base of the PR and between a333331 and 9cb76e7.

📒 Files selected for processing (3)
  • src/lib/sandbox-state.ts
  • src/nemoclaw.ts
  • test/snapshot.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/nemoclaw.ts

Comment thread src/lib/sandbox-state.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/sandbox-state.ts`:
- Around line 601-608: The helper existingBackupDirs currently uses existsSync
and can return files or broken symlinks; change it to call lstatSync on
path.join(backupPath, dirName) and only push dirName if stats.isDirectory() is
true, wrapping the lstatSync call in a try/catch to ignore non-existent/broken
entries or errors. Update function existingBackupDirs to perform this
directory-only check so restoreSandboxState and downstream localDirs only see
real directories.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a348755c-a89f-490a-84eb-cf1961b75427

📥 Commits

Reviewing files that changed from the base of the PR and between 9cb76e7 and d97d986.

📒 Files selected for processing (2)
  • src/lib/sandbox-state.ts
  • test/snapshot.test.ts

Comment thread src/lib/sandbox-state.ts

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I reviewed the current head ea9f322 and the focused fix now directly addresses #2727: partial tar output is preserved without treating clean empty dirs as failures, failed dirs are excluded from restore, manifest-backed restorable dirs are validated, and restore only uses real extracted directories. Local build/typecheck/focused tests pass, and the targeted nightly-e2e run for snapshot-commands-e2e and rebuild-openclaw-e2e is green.

@ericksoa ericksoa enabled auto-merge (squash) May 4, 2026 23:04
@ericksoa ericksoa merged commit 75ee317 into NVIDIA:main May 4, 2026
45 of 48 checks passed
ericksoa added a commit to Saibabu7770/NemoClaw that referenced this pull request May 8, 2026
Replays blueprint policy additions application and strict Ollama chat-completions tool-call validation on current main. Drops the rebuild backup portion because NVIDIA#2734 already covers NVIDIA#2727.
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nemoclaw rebuild aborts when files in .openclaw-data are root-owned

3 participants