recsplit: sharded FuseFilter by AskAlexSharov · Pull Request #20644 · erigontech/erigon

AskAlexSharov · 2026-04-18T09:31:51Z

Fixes #20560.

Problem:
bloatnet 3B keys .efi file building: FuseFIlter used t2hash=29GB and reverseOrder=26GB
we can't mmap FuseFilter without forking it. But maybe it's okey. Maybe it's better to shard FuseFilter rather than mmap large thing. Maybe sharding will improve data-locality.

--
Solution:
shard fusefilter by first byte of keyHash. it will reduce building allocation buffers 256x times. And it will not add reading overhead. (keyHash>>56 is not same as byte(keyHash). it's hi/lo bytes.)

Writer: re-using 1 xorfilter.BuildBinaryFuse to build all shards
ShardedReader is a [256]Reader. Lookup: shards[keyHash>>56].ContainsHash(hash)

Depends on: #20722

Copilot

Pull request overview

Adds support for a new RecSplit index format (v2) that uses a sharded fusefilter existence filter to reduce false positives and potentially improve lookup efficiency, while extending test coverage to exercise the new version.

Changes:

Introduce sharded fusefilter writer/reader (256 shards by top byte of hash) and wire it into RecSplit/Index for dataStructureVersion ≥ 2.
Refactor fusefilter blob header handling and add regression tests around header field offsets and sharded round-trips.
Extend RecSplit tests to run the existing two-layer index test suite against v2.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
db/recsplit/recsplit_test.go	Runs the two-layer index test for RecSplit version 2.
db/recsplit/recsplit.go	Creates/uses a sharded existence filter writer for v2 and closes it on cleanup.
db/recsplit/index.go	Adds sharded existence filter reader initialization and lookup gating for v2+.
db/datastruct/fusefilter/fusefilter_writer_test.go	Adds round-trip, truncation, and regression tests for sharded writer/reader and header parsing.
db/datastruct/fusefilter/fusefilter_writer.go	Adds sharded writer implementation and refactors header serialization helpers.
db/datastruct/fusefilter/fusefilter_reader.go	Adds sharded reader implementation and refactors header parsing.

Comments suppressed due to low confidence (1)

db/datastruct/fusefilter/fusefilter_reader.go:101

NewReaderOnBytes slices m into header/data without checking len(m) >= filterBlobHeaderSize, and then slices data[:fingerprintsLen] without validating fingerprintsLen against len(data). With truncated/corrupt input this will panic instead of returning an error; please add explicit bounds checks and return a descriptive error.

func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
	filter := &xorfilter.BinaryFuse[uint8]{}

	const headerSize = filterBlobHeaderSize
	header, data := m[:headerSize], m[headerSize:]

	v, features, err := parseHeaderFeatures(header, fName)
	if err != nil {
		return nil, 0, err
	}

	filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
	filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
	filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
	filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
	filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
	fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))

	filter.Fingerprints = data[:fingerprintsLen]
	return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

This PR extends RecSplit’s “less false positives” existence filtering by adding a sharded fusefilter implementation and wiring it into RecSplit/Index for data structure version 2.

Changes:

Add v2 existence filter plumbing in recsplit (writer side) and index (reader side) using a new sharded fusefilter format.
Implement WriterSharded / ReaderSharded for fusefilter blobs and refactor common header/feature parsing.
Expand test coverage to exercise RecSplit v2 and the new sharded fusefilter round-trips.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
db/recsplit/recsplit_test.go	Adds TestTwoLayerIndex coverage for version 2.
db/recsplit/recsplit.go	Adds v2 existence filter writer and writes it during build.
db/recsplit/index.go	Adds v2 existence filter reader and uses it during Lookup/ForceInMem.
db/datastruct/fusefilter/fusefilter_writer_test.go	Adds new sharded writer/reader tests and a regression test for segment count fields.
db/datastruct/fusefilter/fusefilter_writer.go	Refactors header feature init; implements `WriterSharded` and shared serialization helper.
db/datastruct/fusefilter/fusefilter_reader.go	Refactors header parsing; implements `ReaderSharded`.

Comments suppressed due to low confidence (1)

db/datastruct/fusefilter/fusefilter_reader.go:100

NewReaderOnBytes slices m[:headerSize] and later data[:fingerprintsLen] without any length checks. A truncated/corrupted fusefilter blob (or a shard blob inside NewReaderShardedOnBytes) will panic with an out-of-bounds slice instead of returning an error. Please add upfront bounds checks (len(m) >= headerSize, and headerSize+fingerprintsLen <= len(m)) and return a descriptive error when the blob is too small or claims an impossible fingerprints length.

func NewReaderOnBytes(m []byte, fName string) (*Reader, int, error) {
	filter := &xorfilter.BinaryFuse[uint8]{}

	const headerSize = filterBlobHeaderSize
	header, data := m[:headerSize], m[headerSize:]

	v, features, err := parseHeaderFeatures(header, fName)
	if err != nil {
		return nil, 0, err
	}

	filter.SegmentCount = binary.BigEndian.Uint32(header[4:])
	filter.SegmentCountLength = binary.BigEndian.Uint32(header[4+4:])
	filter.Seed = binary.BigEndian.Uint64(header[4+4+4:])
	filter.SegmentLength = binary.BigEndian.Uint32(header[4+4+4+8:])
	filter.SegmentLengthMask = binary.BigEndian.Uint32(header[4+4+4+8+4:])
	fingerprintsLen := int(binary.BigEndian.Uint64(header[4+4+4+8+4+4:]))

	filter.Fingerprints = data[:fingerprintsLen]
	return &Reader{inner: filter, version: v, features: features, m: m}, headerSize + fingerprintsLen, nil

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… constant

Fixes #20560. Problem: bloatnet 3B keys .efi file building: FuseFIlter used `t2hash=29GB` and `reverseOrder=26GB` we can't mmap FuseFilter without forking it. But maybe it's okey. Maybe it's better to shard FuseFilter rather than mmap large thing. Maybe sharding will improve data-locality. -- Solution: shard `fusefilter` by first byte of `keyHash`. it will reduce building allocation buffers 256x times. And it will not add reading overhead. (`keyHash>>56` is not same as `byte(keyHash)`. it's `hi/lo` bytes.) - Writer: re-using 1 `xorfilter.BuildBinaryFuse` to build all shards - `ShardedReader` is a `[256]Reader`. Lookup: `shards[keyHash>>56].ContainsHash(hash)` Depends on: #20722 --------- Co-authored-by: JkLondon <me@ilyamikheev.com>

Cherry-pick of [be36d4f](be36d4f) (original PR #20644) onto `performance`. ## Summary - Shards `fusefilter` by `keyHash>>56` (256 shards) to cut FuseFilter build-time RAM ~256× on large `.efi` files (3B-key bloatnet case). - Writer reuses one `xorfilter.BuildBinaryFuse` to build all shards; `ShardedReader` is a `[256]Reader` with lookup `shards[keyHash>>56].ContainsHash(hash)`. - Adds `ExistenceFilterVersion = 2` (sharded format), keeps v1 (monolithic) and v0 (legacy) intact. `recsplit.RecSplitArgs.Version` is now `version.DataStructureVersion` (was `uint8`). - Wires `snaptype.BuildIndex` to pick `recsplit.ExistenceFilterVersion` whenever the index version is not `v1.0` (v1.0 keeps inner v=0 for compat). - Brings the full new test set: sharded writer/reader tests, corruption tests, fuzz tests, real-file tests, and benchmarks + `scripts/fusefilter-bench/`. ## Cherry-pick status Clean cherry-pick. No manual conflict resolutions required — `performance` already has all the prerequisites the original commit assumed: - `xorfilter v0.5.1` (provides `MakeBinaryFuseBuilder` / `BuildBinaryFuse`) - Package-scope `headerSize` in `db/datastruct/fusefilter/fusefilter_reader.go` - `offsetFile *os.File` / `offsetWriter *bufio.Writer` in `recsplit.RecSplit` (from PR #20722) Auto-merge handled minor textual deltas in `.gitignore`, `fusefilter_reader.go`, `recsplit/index.go`, and `db/state/domain.go` without conflicts. ## Test plan - [ ] `go build` of affected packages: passes locally (`db/recsplit/...`, `db/datastruct/fusefilter/...`, `db/snaptype/...`) - [ ] `go test ./db/datastruct/fusefilter/... ./db/recsplit/...` - [ ] CI on the branch > Note: `./db/state/statecfg/...` fails to build with `undefined: InitSchemas` on `origin/performance` independently of this PR — pre-existing issue on the base branch, unrelated to the cherry-pick. ## References - Original PR: #20644 - Original issue: #20560 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: JkLondon <me@ilyamikheev.com>

## Summary Re-enable the sharded FuseFilter existence-filter format by setting `ExistenceFilterVersion` back from `1` to `2` in `db/recsplit/recsplit.go`. This is the switch that erigontech#21164 turned off after erigontech#20644 introduced the format. New index files (`.kvi`/`.efi`) will again embed the sharded BinaryFuse filter (256 shards by `keyHash>>56`), which cuts existence-filter build-time RAM ~256× (the motivation was bloatnet 3B-key `.efi` builds). ## Why it's safe - **Read path is version-aware.** `db/recsplit/index.go` reads the embedded `dataStructureVersion` per file and dispatches to v0/v1/v2 readers, so existing v1 files stay readable and new v2 files are read by the matching `case 2` path. - **Accessor/existence files are built locally, not seeded.** `SeedableV3Extensions()` publishes only `.kv/.v/.ef/.ap`; `.kvi/.efi/.kvei` are rebuilt locally via `BuildMissedAccessors`, so this does not affect the shared snapshot set. ## Test plan - [ ] CI green Co-authored-by: JkLondon <me@ilyamikheev.com>

AskAlexSharov added 11 commits April 18, 2026 14:01

save

731bf48

save

567454c

save

e791d9e

save

c115749

save

58642e3

save

0e457b2

save

c5aee04

save

73cdc2c

save

0feced0

save

18cfed7

Merge remote-tracking branch 'origin/main' into alex/sharded_fuse2_35

c3554c7

AskAlexSharov requested a review from sudeepdino008 as a code owner April 18, 2026 09:31

AskAlexSharov requested review from Copilot and removed request for sudeepdino008 April 18, 2026 09:32

Copilot started reviewing on behalf of AskAlexSharov April 18, 2026 09:32 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread db/recsplit/index.go

Comment thread db/datastruct/fusefilter/fusefilter_reader.go

AskAlexSharov added 3 commits April 18, 2026 17:02

save

fdd36a8

save

6f07469

Merge remote-tracking branch 'origin/main' into alex/sharded_fuse2_35

9218f8d

AskAlexSharov requested a review from Copilot April 18, 2026 10:08

Copilot started reviewing on behalf of AskAlexSharov April 18, 2026 10:08 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread db/datastruct/fusefilter/fusefilter_reader.go Outdated

Comment thread db/datastruct/fusefilter/fusefilter_reader.go Outdated

AskAlexSharov added 5 commits April 18, 2026 17:26

save

7700b37

save

565fe41

Merge branch 'main' into alex/sharded_fuse2_35

2ec478a

save

27e9a64

Merge branch 'main' into alex/sharded_fuse2_35

5cde2eb

AskAlexSharov requested a review from wmitsuda April 20, 2026 11:39

AskAlexSharov mentioned this pull request Apr 20, 2026

experiment: systems-level shard layer above RecSplit buckets #20687

Closed

Merge branch 'main' into alex/sharded_fuse2_35

2ded51a

AskAlexSharov and others added 16 commits May 4, 2026 13:48

db/recsplit: fix newExistenceFilterWriter to use instance version not…

c24a87b

… constant

commit

0327a30

commit

2410378

commit

802b7a5

commit

561b119

commit

a6919f3

commit

8cab667

Merge branch 'main' into pr-20644

1faee63

commit

7baa21c

commit

cfd5efb

commit

5d382b8

commit

d1ffbcc

commit

4b90b8d

commit

a7f1d5f

commit

f4d3b14

commit

57cdb6c

AskAlexSharov mentioned this pull request May 12, 2026

[performance] Horrible performance on commitment merge of bloatnet #20010

Closed

JkLondon changed the title ~~[wip] recsplit: sharded FuseFilter~~ recsplit: sharded FuseFilter May 12, 2026

Merge branch 'main' into alex/sharded_fuse2_35

76fefde

JkLondon enabled auto-merge May 12, 2026 09:26

JkLondon added this pull request to the merge queue May 12, 2026

JkLondon self-assigned this May 12, 2026

Merged via the queue into main with commit be36d4f May 12, 2026
37 checks passed

JkLondon deleted the alex/sharded_fuse2_35 branch May 12, 2026 10:43

JkLondon mentioned this pull request May 12, 2026

recsplit: sharded FuseFilter (cherry-pick #20644) #21144

Merged

3 tasks

JkLondon mentioned this pull request Jun 13, 2026

db/recsplit: enable sharded FuseFilter by default #21793

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recsplit: sharded FuseFilter#20644

recsplit: sharded FuseFilter#20644
JkLondon merged 52 commits into
mainfrom
alex/sharded_fuse2_35

AskAlexSharov commented Apr 18, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AskAlexSharov commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AskAlexSharov commented Apr 18, 2026 •

edited

Loading