Skip to content

fix(backup): retry on tar EOF race and skip known volatile files#72251

Merged
steipete merged 10 commits intoopenclaw:mainfrom
abnershang:fix/backup-create-tar-eof-race
May 9, 2026
Merged

fix(backup): retry on tar EOF race and skip known volatile files#72251
steipete merged 10 commits intoopenclaw:mainfrom
abnershang:fix/backup-create-tar-eof-race

Conversation

@abnershang
Copy link
Copy Markdown
Contributor

@abnershang abnershang commented Apr 26, 2026

Problem

openclaw backup create fails with Error: did not encounter expected EOF on any live installation. Root cause: node-tar's WriteEntry records size from the initial lstat(); if the file grows between lstat() and the end of fs.read(), the header no longer matches the consumed bytes, the stream errors, and tar.c() aborts the whole archive. Common culprits are append-only JSONL transcripts and .log files under the state directory.

Details and generic reproducer: #72249. Related but distinct races: #67417 (ENOENT when a session file is deleted mid-backup). Broader exclude-rule proposal: #67990.

Impact

Affects any live install where backup source files are written concurrently — session transcripts, cron run logs, gateway logs. Today the workaround is to stop the gateway before backing up, which is not viable for scheduled backups.

Fix

  1. Default volatile-path filter in the backup archiver — no user config required:

    • {stateDir}/sessions/**/*.{jsonl,log}
    • {stateDir}/cron/runs/**/*.log
    • {stateDir}/logs/**/*.{jsonl,log}
    • *.{sock,pid,tmp,lock} anywhere

    Rationale: these files are either actively appended to (logs), transient (sockets/pid/lock), or have no restoration value from a partial tail. Snapshotting them racily has no upside.

  2. Retry on EOF-class errors (up to 3 attempts, 10s / 20s backoff) for residual races on non-filtered files. The partial temp archive is removed between attempts, so no half-written artifact escapes.

  3. Error enrichment: on final failure, surface err.path (the offending file) and attempt count. Turns an opaque error into an actionable one.

  4. skippedVolatileCount observability: extend BackupCreateResult with the skip count; surface it in both the stdout summary and the --json output.

Backwards compatibility

  • Filter is additive; no previously-included path stops being included unless it matches a volatile rule. The volatile rules cover files that already produced errors or had no snapshot value.
  • --json output gains a new optional field skippedVolatileCount. Existing consumers that ignore unknown fields are unaffected.
  • Error message format changes only on failure, and only to add diagnostic context.

Testing

Added:

  • src/infra/backup-volatile-filter.test.ts — predicate unit tests for each rule and the anywhere-patterns, plus negative cases.
  • src/infra/backup-create.test.ts — retry tests using a mocked tar runner:
    • Throws EOF twice then succeeds → 3 calls, 10s + 20s backoff, logs emitted.
    • Throws EOF on every attempt → final error includes offending path + attempt count.
    • Throws non-EOF → no retry, no sleep.
  • formatBackupCreateSummary updated to include the skip-count line when > 0.

Verified locally: a previously-failing live-install backup (large state tree with actively written session and cron logs) now completes cleanly; the skipped-volatile count is reported and the archive is usable.

Future work (not in this PR)

A --quiesce flag that pauses cron and drains in-flight work before archival, for users who want stronger consistency guarantees than "retry-plus-filter".

Fixes #72249
See also #67417, #67990

Real behavior proof

Behavior or issue addressed: openclaw backup create on a live install with actively mutating session/log files. The fix should let the backup complete by skipping known volatile files instead of aborting the whole archive on tar EOF races.

Real environment tested: Remote Linux VM (Blacksmith sandbox), exact PR commit 053c118900e6afddd05b899fba2b03f21f237131, fresh pnpm install --frozen-lockfile, Node v20.20.0, pnpm 10.33.2. No local gateway on this machine was restarted or used for the proof.

Exact steps or command run after this patch:

  • Sync the PR checkout to the remote VM and run corepack prepare pnpm@10.33.0 --activate && pnpm install --frozen-lockfile.
  • Run node --import tsx /home/runner/proof-run/openclaw-backup-proof.mjs from /home/runner/crabbox-proof/openclaw.
  • That script creates an isolated OpenClaw state dir, continuously appends to state/sessions/s-live/transcript.jsonl and state/logs/gateway.jsonl, keeps state/gateway.pid and state/pending.tmp present, runs the real backupCreateCommand(..., { output: outDir, includeWorkspace: false, verify: true }) code path, and then inspects the resulting tarball entries.

Evidence after fix:

Backup skipped 4 volatile files (live session/cron logs, sockets, pid/lock/tmp).
Backup archive: /home/runner/proof-run/live-backup/out
Included 1 path:
- state: ~/state
Created /home/runner/proof-run/live-backup/out
Skipped 4 volatile files (live session/cron logs, sockets, pid/lock/tmp).
Archive verification: passed
---PROOF_SUMMARY---
{
  "archivePath": "/home/runner/proof-run/live-backup/out",
  "verified": true,
  "skippedVolatileCount": 4,
  "writesDuringBackup": 75,
  "elapsedMs": 627,
  "transcriptBytesBefore": 28,
  "transcriptBytesAfter": 3244,
  "gatewayLogBytesBefore": 26,
  "gatewayLogBytesAfter": 3242,
  "archiveHasTranscript": false,
  "archiveHasGatewayLog": false,
  "archiveHasPid": false,
  "archiveHasTmp": false,
  "archiveHasMeta": true,
  "archiveHasConfig": true
}

Observed result after fix: The backup completed successfully and backup verify passed while the transcript and gateway log kept growing during the run (writesDuringBackup: 75, both byte counts increased). The resulting archive excluded the live-mutation targets (transcript.jsonl, gateway.jsonl, gateway.pid, pending.tmp) and still kept stable state (meta.json, openclaw.json).

What was not tested: I did not run the packaged pnpm openclaw backup create wrapper on this sandbox; this proof used the real backup command path from the PR checkout via node --import tsx after a fresh install. I also did not run a Windows or macOS live proof in this pass.

@openclaw-barnacle openclaw-barnacle Bot added commands Command implementations size: M labels Apr 26, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR fixes backup create failures on live installations by adding a volatile-path filter (skipping actively-written session/cron/log files and transient socket/pid/lock/tmp files), a 3-attempt retry loop with 10s/20s backoff for residual EOF races, and skippedVolatileCount observability in both the result object and CLI output. The previously flagged skippedVolatileCount accumulation across retries and the overly-broad /EOF/i regex have both been corrected in the current code.

Confidence Score: 5/5

This PR is safe to merge; implementation is correct and well-tested with no P0 or P1 issues remaining.

All logic is sound: backoff indexing is correct (2-element array covers exactly the 2 inter-attempt sleeps needed for 3 attempts), the EOF regex is appropriately narrowed, the closure counter reset inside runTar() prevents cumulative overcounting on retries, and the finally block ensures temp file cleanup even on final failure. Test coverage addresses the key behavioral contracts. No new security surface or data-loss path was introduced.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(backup): narrow EOF retry classifier..." | Re-trigger Greptile

Comment on lines +436 to +471
let skippedVolatileCount = 0;
const tarFilter = (entryPath: string): boolean => {
// The manifest is staged in a tmp dir outside any state directory and
// is always safe to include.
if (path.resolve(entryPath) === manifestPath) {
return true;
}
if (isVolatileBackupPath(entryPath, volatilePlan)) {
skippedVolatileCount += 1;
return false;
}
return true;
};
await writeTarArchiveWithRetry({
tempArchivePath,
log: opts.log,
runTar: () =>
tar.c(
{
file: tempArchivePath,
gzip: true,
portable: true,
preservePaths: true,
filter: tarFilter,
onWriteEntry: (entry) => {
entry.path = remapArchiveEntryPath({
entryPath: entry.path,
manifestPath,
archiveRoot,
});
},
},
[manifestPath, ...result.assets.map((asset) => asset.sourcePath)],
),
});
result.skippedVolatileCount = skippedVolatileCount;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 skippedVolatileCount accumulates across retry attempts

skippedVolatileCount is a closure variable that is never reset between runTar() invocations. On each retry tar.c() re-traverses every file and calls tarFilter again, incrementing the counter for the same volatile files a second (and third) time. With 5 volatile files and 3 attempts the reported count would be 15, not 5.

Fix: reset the counter at the top of runTar before each tar.c() call:

runTar: () => {
  skippedVolatileCount = 0;
  return tar.c(
    {
      file: tempArchivePath,
      gzip: true,
      portable: true,
      preservePaths: true,
      filter: tarFilter,
      onWriteEntry: (entry) => {
        entry.path = remapArchiveEntryPath({
          entryPath: entry.path,
          manifestPath,
          archiveRoot,
        });
      },
    },
    [manifestPath, ...result.assets.map((asset) => asset.sourcePath)],
  );
},
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/backup-create.ts
Line: 436-471

Comment:
**`skippedVolatileCount` accumulates across retry attempts**

`skippedVolatileCount` is a closure variable that is never reset between `runTar()` invocations. On each retry `tar.c()` re-traverses every file and calls `tarFilter` again, incrementing the counter for the same volatile files a second (and third) time. With 5 volatile files and 3 attempts the reported count would be 15, not 5.

Fix: reset the counter at the top of `runTar` before each `tar.c()` call:

```
runTar: () => {
  skippedVolatileCount = 0;
  return tar.c(
    {
      file: tempArchivePath,
      gzip: true,
      portable: true,
      preservePaths: true,
      filter: tarFilter,
      onWriteEntry: (entry) => {
        entry.path = remapArchiveEntryPath({
          entryPath: entry.path,
          manifestPath,
          archiveRoot,
        });
      },
    },
    [manifestPath, ...result.assets.map((asset) => asset.sourcePath)],
  );
},
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 96243ba0b2. Reset skippedVolatileCount = 0 at the top of runTar() before tar.c(), plus a unit test asserting the counter reflects the final attempt rather than the cumulative sum across retries. Thanks for the catch.

Comment thread src/infra/backup-create.ts Outdated
return true;
}
const message = (err as Error).message ?? "";
return /did not encounter expected|unexpected end of|EOF/i.test(message);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Broad EOF regex alternative may match unrelated errors

The bare /EOF/i alternative will match any error message containing the substring "EOF" (e.g., SSL EOF occurred in violation of protocol, OpenSSL unexpected eof while reading). In a tar context this only causes an extra unnecessary retry rather than data loss, but narrowing the pattern reduces the risk of masking non-tar errors.

Suggested change
return /did not encounter expected|unexpected end of|EOF/i.test(message);
return /did not encounter expected|unexpected end of file|TAR_BAD_ARCHIVE/i.test(message);
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/backup-create.ts
Line: 112

Comment:
**Broad `EOF` regex alternative may match unrelated errors**

The bare `/EOF/i` alternative will match any error message containing the substring "EOF" (e.g., SSL `EOF occurred in violation of protocol`, OpenSSL `unexpected eof while reading`). In a tar context this only causes an extra unnecessary retry rather than data loss, but narrowing the pattern reduces the risk of masking non-tar errors.

```suggestion
  return /did not encounter expected|unexpected end of file|TAR_BAD_ARCHIVE/i.test(message);
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dded2d10ab. Narrowed the regex to /(did not encounter expected|encountered unexpected) EOF|TAR_BAD_ARCHIVE/i — your suggested alternation didn't cover the shrink case ("encountered unexpected EOF"), so the shipped version handles both node-tar WriteEntry#onread messages plus the parser's TAR_BAD_ARCHIVE code. Added a comment citing the node-tar source and a unit test table pinning down match/no-match behavior (including SSL/OpenSSL strings we explicitly don't retry on). Thanks.

@abnershang
Copy link
Copy Markdown
Contributor Author

Addressed both Greptile findings:

  • P1 (96243ba0b2)fix(backup): reset skippedVolatileCount on each retry attempt. Resets the counter at the top of runTar() so the reported count reflects only the final attempt instead of summing across retries. Unit test added via the existing retry-test harness.

  • P2 (dded2d10ab)fix(backup): narrow EOF retry classifier to tar-specific errors. Greptile's proposed alternation missed the shrink-case message ("encountered unexpected EOF"); shipped a version that matches both node-tar WriteEntry#onread EOF messages plus TAR_BAD_ARCHIVE, and rejects SSL/OpenSSL EOF strings. Classifier unit test table added.

Branch pushed, lint clean, backup tests green (56/56 across backup-create, backup-volatile-filter, archive-path).

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Codex review: needs real behavior proof before merge.

Summary
The PR adds backup tar EOF retries, volatile-file filtering, enriched backup errors, skipped-volatile reporting, and focused backup tests.

Reproducibility: yes. Source inspection is enough here: current main writes sessions under stateDir/agents//sessions/*.jsonl and cron logs under cron/runs/.jsonl, while the PR filter does not match those paths and globally skips meaningful workspace *.lock files.

Real behavior proof
Needs stronger real behavior proof before merge: Terminal proof is supplied, but it exercises synthetic paths at an older PR commit and does not prove the latest head against current agent-scoped session, cron JSONL, delivery-queue, or workspace-skip behavior. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
Contributor action is needed to repair the branch and provide updated real behavior proof; this should not enter the automated repair lane while the external proof gate is still insufficient.

Security
Cleared: The diff touches backup command/infra code and tests only; I found no workflow, dependency, package-script, permission, credential, or downloaded-code supply-chain change.

Review findings

  • [P1] Don't skip workspace lock files globally — src/infra/backup-volatile-filter.ts:60-62
  • [P2] Match current live session and cron paths — src/infra/backup-volatile-filter.ts:70-76
  • [P3] Keep retry notices out of JSON output — src/commands/backup.ts:27
Review details

Best possible solution:

Revise the PR so volatile skips are scoped to known state/runtime paths, current agent session and cron JSONL paths are covered, JSON mode stays machine-readable, docs describe the new backup exclusions, and latest-head proof shows the real backup path passing.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection is enough here: current main writes sessions under stateDir/agents//sessions/*.jsonl and cron logs under cron/runs/.jsonl, while the PR filter does not match those paths and globally skips meaningful workspace *.lock files.

Is this the best way to solve the issue?

No, not yet. Retry-plus-filter is a maintainable direction for the node-tar EOF race, but the filter must be scoped to state/runtime volatility and aligned with current storage contracts before this is the best fix.

Full review comments:

  • [P1] Don't skip workspace lock files globally — src/infra/backup-volatile-filter.ts:60-62
    isVolatileBackupPath returns true for .sock, .pid, .tmp, and .lock before checking any stateDir anchor, and createBackupArchive applies this filter to every asset. Since workspace directories are included by default, this silently drops valid workspace files such as Cargo.lock or poetry.lock from backup archives. Scope these suffix skips to known state/runtime locations instead of all backup sources.
    Confidence: 0.92
  • [P2] Match current live session and cron paths — src/infra/backup-volatile-filter.ts:70-76
    Current session transcripts live under stateDir/agents/<agentId>/sessions/*.jsonl, and cron run logs are cron/runs/<jobId>.jsonl. This filter only covers stateDir/sessions/**/*.{jsonl,log} and cron *.log, so the main files likely to trigger the EOF race remain archived.
    Confidence: 0.95
  • [P3] Keep retry notices out of JSON output — src/commands/backup.ts:27
    Defaulting the new logger to runtime.log means retry and skip notices can be printed before writeRuntimeJson when opts.json is true, especially for plain RuntimeEnv callers that use the same log sink for JSON. Suppress the default logger in JSON mode unless a caller explicitly passes one.
    Confidence: 0.8

Overall correctness: patch is incorrect
Overall confidence: 0.93

Acceptance criteria:

  • pnpm test src/infra/backup-create.test.ts src/infra/backup-volatile-filter.test.ts src/commands/backup.test.ts
  • pnpm exec oxfmt --check --threads=1 src/commands/backup.ts src/infra/backup-create.ts src/infra/backup-volatile-filter.ts src/infra/backup-create.test.ts src/infra/backup-volatile-filter.test.ts docs/cli/backup.md CHANGELOG.md
  • Real latest-head backup proof with current agent-scoped session JSONL, cron run JSONL, delivery-queue JSON, and a workspace file such as Cargo.lock that must remain archived

What I checked:

  • PR globally skips transient extensions: The PR returns true for .sock/.pid/.tmp/.lock before checking stateDir anchors, so the volatile filter applies to every archived asset, including workspaces. (src/infra/backup-volatile-filter.ts:60, 9d2812f7bdc5)
  • Workspaces are backed up by default: Current docs say workspace directories discovered from config are included unless --no-include-workspace is passed, making global .lock/.tmp filtering a backup-content regression for real workspace files. Public docs: docs/cli/backup.md. (docs/cli/backup.md:36, 678c41a222cf)
  • Current session transcript path: Current main resolves session transcripts under stateDir/agents//sessions, not stateDir/sessions. (src/config/sessions/paths.ts:15, 678c41a222cf)
  • Current cron run log path: Current main writes cron run logs as cron/runs/.jsonl, while the PR filter only matches cron run .log files. (src/cron/run-log.ts:82, 678c41a222cf)
  • JSON output contamination path: The PR defaults backup retry/skip notices to runtime.log, and writeRuntimeJson falls back to runtime.log for plain RuntimeEnv callers, so notices can precede JSON output. (src/runtime.ts:106, 678c41a222cf)
  • Proof gap: The PR body's terminal proof uses synthetic state/sessions and state/logs paths at commit 053c118, not the latest head's delivery-queue change or current agent-scoped sessions and cron JSONL paths. (9d2812f7bdc5)

Likely related people:

  • shichangs: The local backup CLI and backup archive implementation appear to originate in feat: add local backup CLI (feat: add local backup CLI #40163). (role: introduced behavior; confidence: high; commits: 0ecfd37b4465; files: src/infra/backup-create.ts, src/commands/backup.ts, src/commands/backup-shared.ts)
  • Peter Steinberger: Recent main history and blame for the backup, session, cron, and delivery-queue areas in this checkout point to ongoing maintenance by Peter Steinberger. (role: recent maintainer; confidence: medium; commits: 95b936fd46a6, 86c0e3d3eb17, eeef4864494f; files: src/infra/backup-create.ts, src/commands/backup.ts, src/config/sessions/paths.ts)

Remaining risk / open question:

  • The latest PR head still lacks real behavior proof for current agent-scoped sessions, cron JSONL logs, and the delivery-queue addition.
  • The backup content change is user-facing and still needs docs coverage for skipped volatile files and the new JSON field.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 678c41a222cf.

Re-review progress:

@abnershang
Copy link
Copy Markdown
Contributor Author

Rebased onto latest main, composed with the new buildExtensionsNodeModulesFilter from upstream (extensions node_modules exclusion), and all tests green (38/38 — backup-create + backup-volatile-filter). The only typecheck failure in tsgo:all is a pre-existing missing-types error in extensions/acpx unrelated to this branch.

@abnershang
Copy link
Copy Markdown
Contributor Author

abnershang commented Apr 29, 2026

Follow-up from cross-review:

  • Path canonicalization in isVolatileBackupPathnormalizePosix now runs inputs through path.posix.normalize, so traversal segments like sessions/../config.jsonl can no longer be misclassified as volatile through a bare startsWith ancestry check. Added direct predicate tests for .. escapes and Windows-style backslash inputs.
  • Retry cleanup hygiene — temp-archive removal between attempts now only silently swallows ENOENT; any other fs.rm error (EPERM, EACCES, …) is logged before the retry continues, so a real hygiene problem is no longer hidden.

Tests: 19 passed (volatile-filter) + 21 passed (backup-create) = 40 green.

New tip: b777111e.

@abnershang abnershang force-pushed the fix/backup-create-tar-eof-race branch from b369e67 to 07b9899 Compare May 6, 2026 08:10
@openclaw-barnacle openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 6, 2026
@abnershang abnershang force-pushed the fix/backup-create-tar-eof-race branch from 07b9899 to 053c118 Compare May 6, 2026 08:16
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. size: L and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. size: M labels May 6, 2026
@abnershang
Copy link
Copy Markdown
Contributor Author

Status update — re-verified against v2026.5.7

Ran a sanity check against openclaw v2026.5.7-3-gabffed82d8 after upstream's e11eb03182 fix: exclude plugin dependencies from backups (v2026.4.29) landed. Two openclaw backup create runs both failed, with two distinct errors that are not addressed by the plugin-deps exclusion:

Run 1

Error: ENOENT: no such file or directory, lstat
  '$STATE_DIR/delivery-queue/3fac5e46-42dc-4230-a725-51c203830b4f.json'

Run 2

Error: did not encounter expected EOF

Run 2 is exactly the tar EOF race this PR's retry path already handles. Run 1 is a new volatile-file class the filter didn't cover: {stateDir}/delivery-queue/*.json entries are transient IPC state created and consumed by the messaging delivery queue, and they routinely vanish between tar's directory read and lstat().

What changed in this push (9d2812f7bd)

Extended the volatile filter to skip {stateDir}/delivery-queue/**/*.json. Same isUnder + hasExtension helpers, same rule style, JSDoc updated. Added 4 focused tests (match, nested, wrong extension, outside stateDir). pnpm vitest run src/infra/backup-volatile-filter.test.ts src/infra/backup-create.test.ts44/44 green.

Upstream's plugin-deps fix is complementary, not a substitute. Keeping this PR open as the tar-EOF-race hardening.

@steipete steipete force-pushed the fix/backup-create-tar-eof-race branch from 9d2812f to c7ae86a Compare May 9, 2026 08:56
steipete added a commit to abnershang/openclaw that referenced this pull request May 9, 2026
@openclaw-barnacle openclaw-barnacle Bot added the docs Improvements or additions to documentation label May 9, 2026
steipete added a commit to abnershang/openclaw that referenced this pull request May 9, 2026
steipete added a commit to abnershang/openclaw that referenced this pull request May 9, 2026
tar.c() rejects with 'did not encounter expected EOF' when a source file
grows between lstat() and stream consumption. On a running install this
is nearly deterministic against live session transcripts, cron run logs,
and gateway .log/.jsonl files — the same files that have no meaningful
backup value mid-write.

- Add a built-in filter that skips known volatile paths:
    {stateDir}/sessions/**/*.{jsonl,log}
    {stateDir}/cron/runs/**/*.log
    {stateDir}/logs/**/*.{jsonl,log}
    *.{sock,pid,tmp,lock} anywhere
- Retry tar.c() up to 3 times on EOF-class errors with 10s/20s backoff,
  cleaning the partial tmp archive between attempts.
- Enrich the final error message with err.path and attempt count.
- Track and surface skippedVolatileCount in stdout and --json output.

No breaking changes: filter is additive, --json adds a new field, and
the error-message format change is only visible on failure.

Fixes openclaw#72249
See also openclaw#67417, openclaw#67990
abnershang and others added 9 commits May 9, 2026 10:06
tar.c re-walks the staged tree on every invocation, which re-invokes the
filter closure. Previously skippedVolatileCount was initialized once
outside runTar(), so on retry it accumulated across attempts -- e.g.
5 volatile files over 3 attempts reported 15 in both the --json output
and the stderr summary line.

Reset the counter at the top of runTar() so the reported count reflects
only the final (successful or final-failed) attempt, matching what ended
up in the archive.

Adds a unit test that exercises the retry path and asserts the counter
equals the final attempt's value, not the running sum.
The previous regex included a bare /EOF/i alternative that also matched
SSL/OpenSSL strings such as "EOF occurred in violation of protocol"
and "unexpected eof while reading", causing pointless retries on
unrelated failures.

Replace it with an alternation that matches only the two tar-specific
EOF-class messages thrown by node-tar's WriteEntry#onread (grow and
shrink races, citing write-entry.ts) plus the TAR_BAD_ARCHIVE code
surfaced by the parser on truncated input. The code === 'EOF' fast path
stays as-is for errors that propagate with structured fields intact.

Adds a unit test table that pins down the classifier's match and
no-match behavior, including SSL/OpenSSL strings we explicitly do not
want to treat as tar EOF races.
…cleanup

Per Argus review:

- isVolatileBackupPath() now runs inputs through path.posix.normalize() so
  traversal segments like sessions/../config.jsonl cannot be misclassified
  as volatile via a bare startsWith() ancestry check.
- Add tests for '..' escape and Windows-style backslash inputs on the
  predicate itself.
- Retry cleanup between tar attempts now only silently ignores ENOENT; any
  other fs.rm error (EPERM/EACCES/etc.) is logged before the retry continues
  so temp-file hygiene issues are not hidden.

Built-by: Vulcan 🔨
Real-world run of `openclaw backup create` on v2026.5.7 reproduced an

ENOENT race on a delivery-queue entry:

  Error: ENOENT: no such file or directory, lstat

    '$STATE_DIR/delivery-queue/3fac5e46-...json'

Files under `{stateDir}/delivery-queue/**/*.json` are transient IPC

state for the messaging delivery queue: they are created and consumed

continuously and routinely vanish between tar's directory read and

`lstat()`. They have no restoration value — capturing a partial tail

only risks tar corruption without benefit.

Extends the volatile filter with the same `isUnder` + `hasExtension`

rule style as the existing sessions / cron-runs / logs rules, and

keeps the JSDoc rule list in sync.

Tests: 44/44 green across backup-volatile-filter + backup-create.

Built-by: Vulcan 🔨
@steipete steipete force-pushed the fix/backup-create-tar-eof-race branch from 78f2b61 to 3766457 Compare May 9, 2026 09:06
@steipete steipete merged commit 9eaca28 into openclaw:main May 9, 2026
110 checks passed
@steipete
Copy link
Copy Markdown
Contributor

steipete commented May 9, 2026

Landed via squash merge.

  • Source head: 3766457
  • Merge commit: 9eaca28
  • Gate: CI, CodeQL, CodeQL Critical Quality, Workflow Sanity, Real behavior proof all green on the source head.
  • Local/Crabbox proof: pre-fix repro tbx_01kr5xt9vf5pas5ee4aefrp3am; post-fix proof tbx_01kr5y3e1kbtt6chbypfdydbgs; pnpm check:test-types; pnpm lint:core; focused backup tests.

Thanks @abnershang!

@abnershang abnershang deleted the fix/backup-create-tar-eof-race branch May 9, 2026 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations docs Improvements or additions to documentation proof: supplied External PR includes structured after-fix real behavior proof. size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openclaw backup create fails with "did not encounter expected EOF" on live installations

2 participants