Ensure atomic file store publication#826
Conversation
|
🤖 Claude Code Review Status: Complete Current Review: This PR implements atomic file publication for the blob store to prevent partially-written files from being visible. The implementation is well-designed and thoroughly tested. Key Changes:
Observations: The code quality is strong with excellent documentation. The PR author has:
No blocking issues identified. The implementation follows Go best practices and maintains backward compatibility while fixing the production issue of partial file exposure. Note: CodeQL alerts visible in the PR comments appear to be addressed by the os.Root usage throughout the new helper methods (storeRelPath, openStoreRoot, removeStorePath). These ensure paths cannot escape the store root via symlinks or path traversal. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-13 14:39 UTC |
ReviewOverviewThe stated goal — atomic publication of blobs and checksum sidecars in the file store — is solid, and the implementation is well thought through. New helpers ( However the diff is 66 files, ~1965+/-695, of which only 2 commits (~370 lines under 🔴 Major: Scope / ReviewabilityThe PR description acknowledges the bundling, but acknowledgement does not fix the reviewability problem:
Recommendation: land the aerospike commits in their own PR against File Store Changes — In-Scope ReviewStrengths
Concerns1. if err := root.Rename(tmpRel, finalRel); err != nil {
if fileInfo, statErr := root.Stat(finalRel); statErr == nil && !fileInfo.IsDir() {
return errors.NewBlobAlreadyExistsError(...)
}
return errors.NewStorageError(...)
}POSIX 2. If a caller sets 3. Minor:
4. Verification status in PR description. The description notes that repo-wide Tests
Aerospike Changes — Out of Scope (Brief)Not reviewing in depth because these do not belong in this PR. Skim notes:
Recommendation
The file-store change itself is good work and addresses a real production bug. The packaging is what needs fixing. |
oskarszoon
left a comment
There was a problem hiding this comment.
Requesting changes. The file-store work is good and addresses a real production bug, the aerospike native-op design is sound, and Simon's review covers the file-store-scope findings correctly — I won't duplicate any of that. Three new items below, plus a small amount of supporting evidence for his split recommendation.
Merge blocker — go.mod is missing the BSV Aerospike fork in the direct-require block. The PR swaps the upstream Aerospike client for github.com/bsv-blockchain/aerospike-client-go/v8 across 16+ files in stores/utxo/aerospike/, but the new module isn't in go.mod's require block. go.sum has the hash for v8.7.1-bsv2, so MVS will resolve it on the fly, but the version is unpinned in the source of truth. go build ./stores/utxo/aerospike/ fails locally with "no required module provides package". Fix: add github.com/bsv-blockchain/aerospike-client-go/v8 v8.7.1-bsv2 to the direct-require block and run go mod tidy. Worth confirming whether the build happened to work on someone else's machine (GOPATH/GOWORK differences could mask it).
CodeQL #119 is a real defense-in-depth gap. Of the seven CodeQL alerts on stores/blob/file/file.go, six (118, 120–124) are confirmed false positives — the user-provided "key" is hex.EncodeToString(reverse(key)) (charset [0-9a-f]) before reaching any path expression, and the four-layer containment (ConstructFilename → validatePathWithinBase → storeRelPath → os.Root + filepath.IsLocal) handles the rest. Those six are safe to dismiss-with-justification.
Alert #119 is different: os.Remove(dir) at file.go:1159 (empty hash-prefix directory cleanup in Del) is the only path operation in the file that isn't wrapped in os.Root.Remove. It's currently guarded by dir != s.path + len(filepath.Base(dir)) <= 2 heuristic. In practice the hex encoding upstream prevents anything bad landing here, but the os.Root containment story you've otherwise built is intentionally consistent across all path ops in this file — leaving one os.Remove outside the sandbox weakens that story. Trivial fix: wrap in root.Remove(rel) for consistency.
NFS / network-attached storage operational risk. The new atomic-publication path adds per-write fsync(blob) + fsync(checksum) + fsync(dir). On NVMe this is ~200ms cumulatively per 1000-subtree block — fine. On NFS this becomes 20–200 s per block depending on round-trip latency — catastrophic for block-assembly throughput. Could you confirm no production file stores currently sit on NFS / network-attached storage? If any do, the rollout needs filesystem-aware gating (or the fsync needs to be opt-out via a setting for those deployments). Worth a one-line note in the PR description either way.
Two smaller observations, neither blocking:
-
The PR claims native operate-path will deliver a real speedup over the UDF executor by eliminating the per-call Lua interpreter overhead (3–10 µs per record). I agree the design will yield that, but the Verification section has no benchmark numbers backing the implicit 2–5× speedup claim — and we have no CI benches that exercise this path (the aerospike benches require
-tags aerospike+ live container). A before/after bench from a local Aerospike container, posted in the PR body, would close that gap. Same goes for the file-store fsync overhead — noBenchmark*functions exist instores/blob/file/today. -
The
syncParentDirBestEffortdoc framing ("best-effort because support varies by platform and filesystem") is accurate but understates that on ext4/xfs/btrfs the dir-fsync is correctness-critical for final-name durability after crash, not just "nice to have". A future maintainer reading "best-effort" might disable it without realising they're regressing crash safety on the dominant production filesystems. Worth a doc tweak.
Supporting Simon's split recommendation with technical evidence: the six aerospike commits stand on their own — commits 1–2 introduce helpers that are functionally no-op until the probe enables them, commit 3 mechanically switches imports, commits 4–6 iterate on the probe. They have no functional dependency on the two file-store commits, and the spend.go rewrite (445+/125-) is genuinely better INDEPENDENT of native-op (defensive nil-checks, structured ParseLuaMapResponse replacing inline map[interface{}]interface{} type assertions, describeAerospikeBatchRecord diagnostic helpers) — it would be valuable as a standalone diagnostics PR. The BSV fork at v8.7.1-bsv2 is verified published (the proxy.golang.org info endpoint returns 200) and is a strict superset of upstream (adds wire opcodes 200/201 only). Splitting is technically clean.
Verified non-issues worth surfacing: Go 1.26.0 is pinned in go.mod and all CI workflows so os.OpenRoot (Go 1.24+) is safely available; the DAH-migration concern doesn't materialise on upgrade because readDAHFromFile/writeDAHToFile were already dead code in main with the scheduler authoritative; native op type 200 hits the same Aerospike ACL gate as standard ops, and the probe verifies a side-effect rather than just a response code (so MITM-spoofing the probe response is blocked by TLS to the cluster); the BSV Aerospike fork lives under the bsv-blockchain org, same trust boundary as Teranode itself.
I'd be happy to re-review once the go.mod fix lands and the CodeQL #119 wrap is in (the others can be addressed in follow-ups).
f4fa990 to
be3a508
Compare
|
Force-pushed Addressed by the force-push:
Working on now (separate commit(s)):
The aerospike commits will land as a separate PR. |
Re-review after splitRe-reviewed the rebased branch. Scope is now exactly what the title promises: 2 commits, Verified locally on the PR head: CI on this branch is fully green (test, golangci-lint, smoketest, sequential-*, prunertest, chainintegrity, legacy-sync, CodeQL, all benches). Resolved from previous review
Still open (minor, non-blocking)
VerdictThe file-store change is good and addresses the production bug cleanly. The scope split addresses my main concern. Items above are non-blocking — happy for this to merge once (4) is tidied up. Items (1)–(3) are fine as follow-ups. |
Round-trip changes from the review comments left after the initial atomic-publication PR: - CodeQL #119: route the post-Del hash-prefix directory removal through os.Root so every path operation in this file stays inside the store sandbox. - renameTempFile: cross-platform semantics tightened. The ErrBlobAlreadyExists branch is documented as non-POSIX only (rename(2) atomically replaces on POSIX), and an allowOverwrite parameter lets non-POSIX callers explicitly remove the destination and retry instead of receiving a spurious already-exists error when they opted in to overwriting. - syncParentDirBestEffort: doc clarifies that the directory fsync is correctness-critical on ext4/xfs/btrfs after rename, not an optional "nice to have". "Best-effort" applies to failure handling, not to whether the call should run. - createTempSibling: replace big.NewInt(1<<63-1) with big.NewInt(math.MaxInt64) for readability. - fsyncMode opt-out via the ?fsyncMode=full|data|none URL parameter. Default remains full (current behaviour, safe on local disks). Operators running the file store over NFS or other network-attached storage can drop to data (skip the directory fsync) or none (skip both) to avoid the ~10-20 ms per-write cost; the trade-off is weakened crash safety as documented at the fsyncMode declaration. Tests added: - TestSetFromReader_ConcurrentWritersNeverExposePartialFinal: eight concurrent SetFromReader calls on the same key with AllowOverwrite. A reader poller asserts that any successful read of the final filename is byte-identical to the expected header+body payload, never a partial prefix. - TestRenameTempFile_OverwriteAndRejectSemantics: documents the observable POSIX behaviour (rename replaces on AllowOverwrite=true, errorOnOverwrite stops the call earlier when false) so a regression on either branch shows up in CI. - TestParseFsyncMode / TestNewWithFsyncMode: cover the new URL parameter (parsing, end-to-end Set/Get on each mode, invalid value rejection). - BenchmarkSetFromReader_FsyncModes: measures the cost of the atomic-publication path across three payload sizes and the three fsync modes. Local-disk numbers on an Apple M3 Max show fsyncModeFull at ~20 ms/op, fsyncModeData at ~10 ms/op, fsyncModeNone at <1 ms/op — gives operators a concrete frame for evaluating the trade-off on their target filesystem.
|
Pushed From @ordishs
From @oskarszoon
|
|
Re-review after
|
oskarszoon
left a comment
There was a problem hiding this comment.
Every blocking item from the original CHANGES_REQUESTED is resolved in the current code. Two minor follow-ups below, neither blocking.
Per-item verification
1. go.mod / Aerospike fork — RESOLVED
git log upstream/main..HEAD is exactly three commits (64043d59c, be3a508e0, cec145fc0), all scoped to stores/blob/file/. No aerospike commits.
The git diff upstream/main..HEAD -- go.mod output shows one line removed (pgregory.net/rapid v1.3.0), which looks alarming at first but is a red herring: rapid is absent from go.mod at the branch point (merge base adf9e66c5) and absent at HEAD — it was added to upstream/main after this branch was cut, by PR #848. The diff is upstream having moved forward, not this PR touching go.mod. Confirmed: no go.mod or go.sum changes attributable to this PR.
2. CodeQL #119 — os.Root sandbox — MOSTLY RESOLVED, two raw calls remain in Del
The write path is fully sandboxed. createTempSibling uses root.OpenFile, renameTempFile uses root.Rename / root.Remove, removeStorePath uses root.Remove, and syncParentDirBestEffort uses root.Open + dir.Sync(). The hash-prefix directory cleanup that was the original CodeQL site is now routed through openStoreRoot() + root.Remove(rel) at file.go:1208–1214, with the comment explicitly crediting the CodeQL fix. That specific issue is resolved.
However, Del still has two raw os.Remove calls that bypass os.Root:
- file.go:1187:
os.Remove(checksumFile)— the checksum sidecar delete. - file.go:1191:
os.Remove(fileName)— the blob delete.
Both checksumFile and fileName are constructed from merged.ConstructFilename(s.path, key, fileType), so they're derived from the store path, and options-layer path traversal validation happens upstream. But CodeQL #119 was specifically about raw os.Remove being outside the os.Root sandbox — these two are the same pattern as the original finding, just in Del rather than the post-delete cleanup. It's not clear why removeStorePath wasn't used consistently here while the rest of Del uses root.Remove. Worth checking whether CodeQL re-runs against these two.
This is the one item where "addressed" is accurate for the original finding but the same class of issue survives elsewhere in the file. Not a new regression from this PR — Del was unchanged from before — but if CodeQL fired on the hash-prefix dir remove, it may fire here too.
3. NFS / fsyncMode URL param — RESOLVED
parseFsyncMode exists at file.go:79–90. It parses "" and "full" both as fsyncModeFull, "data" as fsyncModeData, "none" as fsyncModeNone, and returns a ConfigurationError for anything else (case-insensitive via strings.ToLower). Default is full — confirmed: parseFsyncMode("") returns fsyncModeFull, nil.
The three modes are documented at the type declaration (file.go:55–77) with explicit operator guidance: what each mode trades off, when to use data (NFS with expensive dir fsync), when to accept none (operator opts out of crash safety). The comment is accurate and appropriately explicit about the tradeoff.
Behaviour verified against code:
fsyncModeNone:syncAndCloseTempFileskipsfile.Sync()at file.go:1292 (if s.fsyncMode != fsyncModeNone).syncParentDirBestEffortreturns immediately at file.go:1326 (if s.fsyncMode != fsyncModeFull). Both fsyncs skipped. Correct.fsyncModeData:syncAndCloseTempFilerunsfile.Sync()(condition is!= None, which passes).syncParentDirBestEffortreturns immediately (!= Full). File content durable, directory entry not. Correct.fsyncModeFull: both run. Correct.
BenchmarkSetFromReader_FsyncModes (file_test.go:767) calls real f.Set() for all three modes across 256B, 4KB, 64KB payloads. It uses a real temp dir and real fsync; on Linux CI the full/data modes invoke actual file.Sync(). The benchmark is meaningful for regression detection of the fsync schedule.
4. Bench numbers for native-op claim — SKIPPED (moot after scope split, as noted)
5. syncParentDirBestEffort doc framing — RESOLVED
The godoc at file.go:1309–1324 is clear. The first two paragraphs establish that on ext4/xfs/btrfs this is correctness-critical, not optional. The third paragraph explicitly scopes "best-effort" to failure handling only:
"The 'best-effort' framing applies only to failure handling: directory sync is not supported uniformly across platforms... so a failure to open or sync the parent directory is logged at debug level rather than turning an already-successful rename into a failed blob write."
This directly addresses the concern. The comment also preemptively warns against removal: "Removing this call regresses crash safety on those filesystems; future maintainers should not delete it under the assumption that 'best effort' implies optional."
ordishs' post-cec145fc0 items
renameTempFile cross-platform semantics: the test at file_test.go:673 (TestRenameTempFile_OverwriteAndRejectSemantics) documents the POSIX contract in the test comment and exercises both allowOverwrite values. The renameTempFile godoc at file.go:1355–1377 explains the POSIX vs non-POSIX divergence and names the TOCTOU window explicitly. Ordishs' concern is addressed.
Concurrent-writer test: TestSetFromReader_ConcurrentWritersNeverExposePartialFinal (file_test.go:588) runs 8 writer goroutines against a single key with AllowOverwrite=true for 500ms while a reader polls os.ReadFile on the final path and asserts byte-identical reads. This does exercise the atomic-publication invariant. The reader uses os.ReadFile rather than the store's Get, which is actually stronger: it verifies the kernel-visible atomicity of the rename, not just the store's API contract. One note: 500ms wall-clock is short; a heavily loaded CI machine could give this test very few reader-observed-writes. It's not a weak test, but it's also not a deterministic stress test — it'll pass even with zero successful reads as long as nothing errors. Worth keeping in mind.
big.NewInt(math.MaxInt64): at file.go:1257, rand.Int(rand.Reader, big.NewInt(math.MaxInt64)) generates a random int64-range integer for temp file naming. rand.Int returns [0, max), so the upper bound is math.MaxInt64 - 1. This is fine — the value is used only as a filename discriminator, not for cryptographic purposes, and the range is large enough to make collisions negligible with 10 retries. Ordishs flagged it as something to double-check; it's correct.
New observations
Raw os.Remove in Del (filing separately from item 2): Both os.Remove(checksumFile) (line 1187) and os.Remove(fileName) (line 1191) bypass os.Root. The rest of Del routes through removeStorePath via root.Remove. This inconsistency is pre-existing and not introduced by this PR, but it's the same pattern CodeQL fired on. If CodeQL scans Del, expect it to reopen the finding. The fix would be replacing those two calls with s.removeStorePath(...).
Health check uses raw os.Remove: lines 553 and 574 both call os.Remove on paths constructed from s.path and the system temp name. These are health-check temp files created in the store root, so the path is controlled, but they're still outside the os.Root boundary. Again pre-existing, not from this PR.
Sonar 64.8% coverage gate: ordishs' framing is correct — renameTempFile's non-POSIX branch (the allowOverwrite remove-and-retry path), syncAndCloseTempFile's close-after-sync-failure branch, and syncParentDirBestEffort's error-on-open branch are all only reachable on non-Linux platforms or via injected I/O errors. A tempfile that fails to sync or close is hard to trigger without fault injection; the renameTempFile Windows branch is physically unreachable on Linux. These aren't coverage gaps the author can close without either platform-specific builds or error-injection hooks. The 64.8% gate failure is a structural issue with the gate threshold, not with the code or tests.
Verdict
Lifting CHANGES_REQUESTED. The three blocking items are resolved as claimed. Two non-blocking follow-ups:
-
Non-blocking:
Del's two rawos.Removecalls for the blob file and checksum file (file.go:1187, 1191) are the same pattern as CodeQL #119 in a different method. Pre-existing, not introduced here. Consider routing throughremoveStorePathin a follow-up PR to close the CodeQL class consistently — the hash-prefix dir cleanup in this same method already does it correctly. -
Non-blocking: concurrent-writer test is real but time-bounded (500ms); it passes with zero observed reads. Consider adding a read counter assertion (
require.Greater(t, reads, 0)) to ensure the race actually runs before the timeout elapses.
|
Re: "merge blocker — go.mod missing the BSV Aerospike fork in the direct-require block." This doesn't reproduce against current PR HEAD (ac99a44). The direct require is present at Verification on the PR head: History also lines up — commit Most likely the failing local build was against a stale checkout from before The other items in that review (CodeQL #119 |


Summary
Fixes the file-backed blob store so final blob and checksum filenames are not exposed until their contents have been fully written, synced, closed, and atomically renamed into place.
This addresses production observations of partly written files appearing under their final filenames.
Changes
.dahhelper code and tests; blob delete-at-height behavior remains in the scheduler/pruner flow.Branch context
This branch has been rebased onto
mainand the PR targetsmain. Since the branch previously sat onfeat/teranode-native-ops, this PR also includes the native-Aerospike commit stack from that base branch in addition to the file-store commit.Notes
os.Renamepublishes within one filesystem.SetDAHremains available for current block assembly and block persister call sites, but the file store implementation schedules or cancels deletion through the blob deletion scheduler. The pruner service owns deletion execution.Verification
Passed after rebasing onto
main:go test -count=1 ./stores/blob/filego test -race -count=1 ./stores/blob/filego test -count=1 ./stores/blob/...go vet ./stores/blob/...git diff --checkPreviously passed before the rebase:
Repository-wide checks were also attempted earlier.
go test ./...andgo vet ./...are not green in this local environment due to unrelated existing or environment-heavy failures, including version-format validation, txmetacache timeout, toxiproxy/Docker/Postgres/SV-node/Kafka/e2e failures, and existing copy-lock warnings in test utilities.