Skip to content

[SharovBot] fix data race in FilesItem.closeFilesAndRemove#21384

Closed
erigon-copilot[bot] wants to merge 1 commit into
mainfrom
erigon-copilot/fix-filesitem-closerace
Closed

[SharovBot] fix data race in FilesItem.closeFilesAndRemove#21384
erigon-copilot[bot] wants to merge 1 commit into
mainfrom
erigon-copilot/fix-filesitem-closerace

Conversation

@erigon-copilot

Copy link
Copy Markdown
Contributor

Summary

  • Fix data race in (*FilesItem).closeFilesAndRemove() detected by -race in TestHistoryVerification_SimpleBlocks
  • Root cause: TOCTOU race between deleteMergeFile (sets canDelete=true, checks refcount.Load()==0) and refcntDecrement (does refcount.Add(-1)==0 && canDelete.Load()), allowing both paths to concurrently enter closeFilesAndRemove() on the same FilesItem
  • Fix: add sync.Once field to FilesItem and wrap closeFilesAndRemove() body in Once.Do(), ensuring at-most-once execution regardless of concurrent callers

Test plan

  • go test -race -run TestHistoryVerification_SimpleBlocks -count=5 ./execution/verify/... — all 5 runs pass, no DATA RACE warnings
  • go build ./... passes
  • No test files modified

🤖 Generated with Claude Code

Use sync.Once to ensure closeFilesAndRemove executes at most once per
FilesItem, preventing concurrent close from deleteMergeFile and
refcntDecrement paths racing on the same instance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-authored-by: Giulio Rebuffo <giulio.rebuffo@gmail.com>

@AskAlexSharov AskAlexSharov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need better solution for this problem, or revert:
#20462
#20490

@yperbasis

Copy link
Copy Markdown
Member

Closing in favour of PR #21397

@yperbasis yperbasis closed this May 27, 2026
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 27, 2026
## Context

`Aggregator.BeginFilesRo()` was made lock-free in erigontech#20462/erigontech#20490, but
physical
file deletion stayed gated by two per-`FilesItem` atomics (`refcount` +
`canDelete`).
Two atomics guarding one destructive action (`closeFilesAndRemove`) is
the TOCTOU
double-free behind erigontech#21384, and a per-file refcount taken *after* the
snapshot
pointer is loaded can't protect the load→pin window by itself.

## What this does

Replaces per-file `refcount`/`canDelete` with MVCC reclamation gated by
a refcount
on the published bundle (`aggregatorVisible`) — the MDBX freelist model
(a page
freed at txnid `T` is reclaimable once the oldest live reader's txnid `>
T`),
realized in Go by reference-counting the generation object instead of
each file.

- Published bundles form an oldest→newest chain; a reader pins exactly
one via
`refcnt`. `refcnt` only grows while a bundle is current, only shrinks
once superseded.
- `BeginFilesRo` does validate-after-pin (one atomic add + re-check),
closing the
  load→pin window. One add instead of dozens of per-file increments.
- Files removed from `dirtyFiles` by a merge/prune are attached to the
outgoing
generation's `retired` set and physically deleted only once that
generation
  (and every older one) drains — reclaimed oldest-first, single owner of
  `closeFilesAndRemove`, no per-file flag, no double-free.
- `DebugBeginDirtyFilesRo` (BuildMissedAccessors) pins the generation
the same way,
so its captured dirty files — including unindexed ones absent from the
visible
  set — are protected for the duration of the accessor build.

`FilesItem.refcount`/`canDelete` are now used only by the forkable
subsystem
(out of scope here).

Design + file lifecycle:
`docs/plans/20260525-lockfree-file-reclamation-spec.md`.

## Status

WIP. Validated locally: `db/state/...` under `-race` (no data races),
`make lint`,
`make erigon integration`.

---------

Co-authored-by: milen <94537774+taratorio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants