fix(backup): retry on tar EOF race and skip known volatile files#72251
fix(backup): retry on tar EOF race and skip known volatile files#72251steipete merged 10 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes Confidence Score: 5/5This PR is safe to merge; implementation is correct and well-tested with no P0 or P1 issues remaining. All logic is sound: backoff indexing is correct (2-element array covers exactly the 2 inter-attempt sleeps needed for 3 attempts), the EOF regex is appropriately narrowed, the closure counter reset inside runTar() prevents cumulative overcounting on retries, and the finally block ensures temp file cleanup even on final failure. Test coverage addresses the key behavioral contracts. No new security surface or data-loss path was introduced. No files require special attention. Reviews (2): Last reviewed commit: "fix(backup): narrow EOF retry classifier..." | Re-trigger Greptile |
| let skippedVolatileCount = 0; | ||
| const tarFilter = (entryPath: string): boolean => { | ||
| // The manifest is staged in a tmp dir outside any state directory and | ||
| // is always safe to include. | ||
| if (path.resolve(entryPath) === manifestPath) { | ||
| return true; | ||
| } | ||
| if (isVolatileBackupPath(entryPath, volatilePlan)) { | ||
| skippedVolatileCount += 1; | ||
| return false; | ||
| } | ||
| return true; | ||
| }; | ||
| await writeTarArchiveWithRetry({ | ||
| tempArchivePath, | ||
| log: opts.log, | ||
| runTar: () => | ||
| tar.c( | ||
| { | ||
| file: tempArchivePath, | ||
| gzip: true, | ||
| portable: true, | ||
| preservePaths: true, | ||
| filter: tarFilter, | ||
| onWriteEntry: (entry) => { | ||
| entry.path = remapArchiveEntryPath({ | ||
| entryPath: entry.path, | ||
| manifestPath, | ||
| archiveRoot, | ||
| }); | ||
| }, | ||
| }, | ||
| [manifestPath, ...result.assets.map((asset) => asset.sourcePath)], | ||
| ), | ||
| }); | ||
| result.skippedVolatileCount = skippedVolatileCount; |
There was a problem hiding this comment.
skippedVolatileCount accumulates across retry attempts
skippedVolatileCount is a closure variable that is never reset between runTar() invocations. On each retry tar.c() re-traverses every file and calls tarFilter again, incrementing the counter for the same volatile files a second (and third) time. With 5 volatile files and 3 attempts the reported count would be 15, not 5.
Fix: reset the counter at the top of runTar before each tar.c() call:
runTar: () => {
skippedVolatileCount = 0;
return tar.c(
{
file: tempArchivePath,
gzip: true,
portable: true,
preservePaths: true,
filter: tarFilter,
onWriteEntry: (entry) => {
entry.path = remapArchiveEntryPath({
entryPath: entry.path,
manifestPath,
archiveRoot,
});
},
},
[manifestPath, ...result.assets.map((asset) => asset.sourcePath)],
);
},
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/backup-create.ts
Line: 436-471
Comment:
**`skippedVolatileCount` accumulates across retry attempts**
`skippedVolatileCount` is a closure variable that is never reset between `runTar()` invocations. On each retry `tar.c()` re-traverses every file and calls `tarFilter` again, incrementing the counter for the same volatile files a second (and third) time. With 5 volatile files and 3 attempts the reported count would be 15, not 5.
Fix: reset the counter at the top of `runTar` before each `tar.c()` call:
```
runTar: () => {
skippedVolatileCount = 0;
return tar.c(
{
file: tempArchivePath,
gzip: true,
portable: true,
preservePaths: true,
filter: tarFilter,
onWriteEntry: (entry) => {
entry.path = remapArchiveEntryPath({
entryPath: entry.path,
manifestPath,
archiveRoot,
});
},
},
[manifestPath, ...result.assets.map((asset) => asset.sourcePath)],
);
},
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 96243ba0b2. Reset skippedVolatileCount = 0 at the top of runTar() before tar.c(), plus a unit test asserting the counter reflects the final attempt rather than the cumulative sum across retries. Thanks for the catch.
| return true; | ||
| } | ||
| const message = (err as Error).message ?? ""; | ||
| return /did not encounter expected|unexpected end of|EOF/i.test(message); |
There was a problem hiding this comment.
Broad
EOF regex alternative may match unrelated errors
The bare /EOF/i alternative will match any error message containing the substring "EOF" (e.g., SSL EOF occurred in violation of protocol, OpenSSL unexpected eof while reading). In a tar context this only causes an extra unnecessary retry rather than data loss, but narrowing the pattern reduces the risk of masking non-tar errors.
| return /did not encounter expected|unexpected end of|EOF/i.test(message); | |
| return /did not encounter expected|unexpected end of file|TAR_BAD_ARCHIVE/i.test(message); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/backup-create.ts
Line: 112
Comment:
**Broad `EOF` regex alternative may match unrelated errors**
The bare `/EOF/i` alternative will match any error message containing the substring "EOF" (e.g., SSL `EOF occurred in violation of protocol`, OpenSSL `unexpected eof while reading`). In a tar context this only causes an extra unnecessary retry rather than data loss, but narrowing the pattern reduces the risk of masking non-tar errors.
```suggestion
return /did not encounter expected|unexpected end of file|TAR_BAD_ARCHIVE/i.test(message);
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in dded2d10ab. Narrowed the regex to /(did not encounter expected|encountered unexpected) EOF|TAR_BAD_ARCHIVE/i — your suggested alternation didn't cover the shrink case ("encountered unexpected EOF"), so the shipped version handles both node-tar WriteEntry#onread messages plus the parser's TAR_BAD_ARCHIVE code. Added a comment citing the node-tar source and a unit test table pinning down match/no-match behavior (including SSL/OpenSSL strings we explicitly don't retry on). Thanks.
|
Addressed both Greptile findings:
Branch pushed, lint clean, backup tests green (56/56 across |
|
Codex review: needs real behavior proof before merge. Summary Reproducibility: yes. Source inspection is enough here: current main writes sessions under stateDir/agents//sessions/*.jsonl and cron logs under cron/runs/.jsonl, while the PR filter does not match those paths and globally skips meaningful workspace *.lock files. Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Revise the PR so volatile skips are scoped to known state/runtime paths, current agent session and cron JSONL paths are covered, JSON mode stays machine-readable, docs describe the new backup exclusions, and latest-head proof shows the real backup path passing. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection is enough here: current main writes sessions under stateDir/agents//sessions/*.jsonl and cron logs under cron/runs/.jsonl, while the PR filter does not match those paths and globally skips meaningful workspace *.lock files. Is this the best way to solve the issue? No, not yet. Retry-plus-filter is a maintainable direction for the node-tar EOF race, but the filter must be scoped to state/runtime volatility and aligned with current storage contracts before this is the best fix. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 678c41a222cf. Re-review progress:
|
dded2d1 to
1c5217a
Compare
|
Rebased onto latest |
|
Follow-up from cross-review:
Tests: 19 passed (volatile-filter) + 21 passed (backup-create) = 40 green. New tip: |
b369e67 to
07b9899
Compare
07b9899 to
053c118
Compare
Status update — re-verified against v2026.5.7Ran a sanity check against openclaw Run 1 Run 2 Run 2 is exactly the tar EOF race this PR's retry path already handles. Run 1 is a new volatile-file class the filter didn't cover: What changed in this push (
|
9d2812f to
c7ae86a
Compare
tar.c() rejects with 'did not encounter expected EOF' when a source file
grows between lstat() and stream consumption. On a running install this
is nearly deterministic against live session transcripts, cron run logs,
and gateway .log/.jsonl files — the same files that have no meaningful
backup value mid-write.
- Add a built-in filter that skips known volatile paths:
{stateDir}/sessions/**/*.{jsonl,log}
{stateDir}/cron/runs/**/*.log
{stateDir}/logs/**/*.{jsonl,log}
*.{sock,pid,tmp,lock} anywhere
- Retry tar.c() up to 3 times on EOF-class errors with 10s/20s backoff,
cleaning the partial tmp archive between attempts.
- Enrich the final error message with err.path and attempt count.
- Track and surface skippedVolatileCount in stdout and --json output.
No breaking changes: filter is additive, --json adds a new field, and
the error-message format change is only visible on failure.
Fixes openclaw#72249
See also openclaw#67417, openclaw#67990
tar.c re-walks the staged tree on every invocation, which re-invokes the filter closure. Previously skippedVolatileCount was initialized once outside runTar(), so on retry it accumulated across attempts -- e.g. 5 volatile files over 3 attempts reported 15 in both the --json output and the stderr summary line. Reset the counter at the top of runTar() so the reported count reflects only the final (successful or final-failed) attempt, matching what ended up in the archive. Adds a unit test that exercises the retry path and asserts the counter equals the final attempt's value, not the running sum.
The previous regex included a bare /EOF/i alternative that also matched SSL/OpenSSL strings such as "EOF occurred in violation of protocol" and "unexpected eof while reading", causing pointless retries on unrelated failures. Replace it with an alternation that matches only the two tar-specific EOF-class messages thrown by node-tar's WriteEntry#onread (grow and shrink races, citing write-entry.ts) plus the TAR_BAD_ARCHIVE code surfaced by the parser on truncated input. The code === 'EOF' fast path stays as-is for errors that propagate with structured fields intact. Adds a unit test table that pins down the classifier's match and no-match behavior, including SSL/OpenSSL strings we explicitly do not want to treat as tar EOF races.
…cleanup Per Argus review: - isVolatileBackupPath() now runs inputs through path.posix.normalize() so traversal segments like sessions/../config.jsonl cannot be misclassified as volatile via a bare startsWith() ancestry check. - Add tests for '..' escape and Windows-style backslash inputs on the predicate itself. - Retry cleanup between tar attempts now only silently ignores ENOENT; any other fs.rm error (EPERM/EACCES/etc.) is logged before the retry continues so temp-file hygiene issues are not hidden. Built-by: Vulcan 🔨
Real-world run of `openclaw backup create` on v2026.5.7 reproduced an
ENOENT race on a delivery-queue entry:
Error: ENOENT: no such file or directory, lstat
'$STATE_DIR/delivery-queue/3fac5e46-...json'
Files under `{stateDir}/delivery-queue/**/*.json` are transient IPC
state for the messaging delivery queue: they are created and consumed
continuously and routinely vanish between tar's directory read and
`lstat()`. They have no restoration value — capturing a partial tail
only risks tar corruption without benefit.
Extends the volatile filter with the same `isUnder` + `hasExtension`
rule style as the existing sessions / cron-runs / logs rules, and
keeps the JSDoc rule list in sync.
Tests: 44/44 green across backup-volatile-filter + backup-create.
Built-by: Vulcan 🔨
78f2b61 to
3766457
Compare
|
Landed via squash merge.
Thanks @abnershang! |
Problem
openclaw backup createfails withError: did not encounter expected EOFon any live installation. Root cause:node-tar'sWriteEntryrecordssizefrom the initiallstat(); if the file grows betweenlstat()and the end offs.read(), the header no longer matches the consumed bytes, the stream errors, andtar.c()aborts the whole archive. Common culprits are append-only JSONL transcripts and.logfiles under the state directory.Details and generic reproducer: #72249. Related but distinct races: #67417 (ENOENT when a session file is deleted mid-backup). Broader exclude-rule proposal: #67990.
Impact
Affects any live install where backup source files are written concurrently — session transcripts, cron run logs, gateway logs. Today the workaround is to stop the gateway before backing up, which is not viable for scheduled backups.
Fix
Default volatile-path filter in the backup archiver — no user config required:
{stateDir}/sessions/**/*.{jsonl,log}{stateDir}/cron/runs/**/*.log{stateDir}/logs/**/*.{jsonl,log}*.{sock,pid,tmp,lock}anywhereRationale: these files are either actively appended to (logs), transient (sockets/pid/lock), or have no restoration value from a partial tail. Snapshotting them racily has no upside.
Retry on EOF-class errors (up to 3 attempts, 10s / 20s backoff) for residual races on non-filtered files. The partial temp archive is removed between attempts, so no half-written artifact escapes.
Error enrichment: on final failure, surface
err.path(the offending file) and attempt count. Turns an opaque error into an actionable one.skippedVolatileCountobservability: extendBackupCreateResultwith the skip count; surface it in both the stdout summary and the--jsonoutput.Backwards compatibility
--jsonoutput gains a new optional fieldskippedVolatileCount. Existing consumers that ignore unknown fields are unaffected.Testing
Added:
src/infra/backup-volatile-filter.test.ts— predicate unit tests for each rule and the anywhere-patterns, plus negative cases.src/infra/backup-create.test.ts— retry tests using a mocked tar runner:formatBackupCreateSummaryupdated to include the skip-count line when > 0.Verified locally: a previously-failing live-install backup (large state tree with actively written session and cron logs) now completes cleanly; the skipped-volatile count is reported and the archive is usable.
Future work (not in this PR)
A
--quiesceflag that pauses cron and drains in-flight work before archival, for users who want stronger consistency guarantees than "retry-plus-filter".Fixes #72249
See also #67417, #67990
Real behavior proof
Behavior or issue addressed:
openclaw backup createon a live install with actively mutating session/log files. The fix should let the backup complete by skipping known volatile files instead of aborting the whole archive on tar EOF races.Real environment tested: Remote Linux VM (Blacksmith sandbox), exact PR commit
053c118900e6afddd05b899fba2b03f21f237131, freshpnpm install --frozen-lockfile, Nodev20.20.0, pnpm10.33.2. No local gateway on this machine was restarted or used for the proof.Exact steps or command run after this patch:
corepack prepare pnpm@10.33.0 --activate && pnpm install --frozen-lockfile.node --import tsx /home/runner/proof-run/openclaw-backup-proof.mjsfrom/home/runner/crabbox-proof/openclaw.state/sessions/s-live/transcript.jsonlandstate/logs/gateway.jsonl, keepsstate/gateway.pidandstate/pending.tmppresent, runs the realbackupCreateCommand(..., { output: outDir, includeWorkspace: false, verify: true })code path, and then inspects the resulting tarball entries.Evidence after fix:
Observed result after fix: The backup completed successfully and
backup verifypassed while the transcript and gateway log kept growing during the run (writesDuringBackup: 75, both byte counts increased). The resulting archive excluded the live-mutation targets (transcript.jsonl,gateway.jsonl,gateway.pid,pending.tmp) and still kept stable state (meta.json,openclaw.json).What was not tested: I did not run the packaged
pnpm openclaw backup createwrapper on this sandbox; this proof used the real backup command path from the PR checkout vianode --import tsxafter a fresh install. I also did not run a Windows or macOS live proof in this pass.