Skip to content

fix: correct RocksDB MaxOpenFiles formula to divide by DB count#10736

Merged
benaadams merged 6 commits into
masterfrom
fix/rocksdb-max-open-files
Mar 9, 2026
Merged

fix: correct RocksDB MaxOpenFiles formula to divide by DB count#10736
benaadams merged 6 commits into
masterfrom
fix/rocksdb-max-open-files

Conversation

@kamilchodola

@kamilchodola kamilchodola commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix RocksDB MaxOpenFiles auto-limit formula to properly divide by estimated database count (~15)
  • The OS file descriptor limit is shared across all databases, but MaxOpenFiles is set per database

Problem

PR #9504 introduced auto-detection of the system open file limit to prevent macOS crashes during sync. The formula perDbLimit = 80% × systemLimit was applied per database, but with ~15 databases the combined limit was 15 × 0.8 × limit = 12× the OS limit. This:

  1. Didn't actually prevent the macOS crash it was designed to fix
  2. Caused a ~43% P90 block processing regression by setting an unnecessarily high per-DB limit that still forced RocksDB into LRU file handle caching mode

Fix

Divide by estimated DB count: perDbLimit = max(128, systemLimit × 0.8 / 15)

System limit Before (per DB) Before (combined) After (per DB) After (combined)
256 (macOS) 256 3840 ❌ 128 1920 ✅
1024 (low Linux) 819 12285 ❌ 128 1920 ✅
65536 (server) 52428 786420 ❌ 3495 52425 ✅

Changes

  • Bug fix (non-breaking change which fixes an issue)

Test plan

  • Updated RocksDbConfigFactoryTests — all 11 tests pass
  • Benchmark validation: run realblocks payload set to verify P90 recovery

The auto-limit logic (PR #9504) set MaxOpenFiles to 80% of the system
ulimit per database, but with ~15 databases the combined limit was
12x the OS total. This didn't prevent the macOS crash it was designed
to fix, and caused a ~43% P90 block processing regression on Linux
by forcing RocksDB to use an LRU file handle cache instead of keeping
all SST handles open.

Fix: properly divide by the estimated database count (~15) so the
combined file handle usage stays within the OS limit.
@kamilchodola kamilchodola force-pushed the fix/rocksdb-max-open-files branch from 9bb3d5a to 181c224 Compare March 6, 2026 09:42
@asdacap

asdacap commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Dont reduce the limit dammit. And dont make it unlimited.

@kamilchodola kamilchodola changed the title fix: only restrict RocksDB MaxOpenFiles on low system limits fix: correct RocksDB MaxOpenFiles formula to divide by DB count Mar 6, 2026
@asdacap

asdacap commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

You cant divide it by num of DB. Some db take up a lot of file, others dont. Thats why per db it is 80%.

@LukaszRozmej

LukaszRozmej commented Mar 6, 2026

Copy link
Copy Markdown
Member

You cant divide it by num of DB. Some db take up a lot of file, others dont. Thats why per db it is 80%.

should we tweak it more explicitly for some dbs? I expect performance here is mostly about state db.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the auto-derived RocksDB MaxOpenFiles value to account for the fact that the OS file-descriptor limit is shared across multiple DB instances, aligning the configuration with the original goal of preventing FD exhaustion (notably on macOS).

Changes:

  • Update RocksDbConfigFactory to divide the 80%-of-system-limit budget by an estimated DB count (~15) when auto-setting MaxOpenFiles.
  • Update RocksDbConfigFactoryTests to validate the new per-DB limit calculation for low and high system limits.
  • Adjust the minimum enforced MaxOpenFiles expectation in tests.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/Nethermind/Nethermind.Db.Rocks/Config/RocksDbConfigFactory.cs Changes the auto-limit formula to divide the FD budget by an estimated number of DBs.
src/Nethermind/Nethermind.Db.Test/RocksDbConfigFactoryTests.cs Updates/extends unit tests to reflect the revised MaxOpenFiles calculation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 31 to 32
int perDbLimit = Math.Max(128, (int)(systemLimit * 0.8 / EstimatedDbCount));

Copilot AI Mar 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new formula can still choose a per-DB MaxOpenFiles (128) that exceeds the computed per-DB budget (systemLimit * 0.8 / EstimatedDbCount). For low system limits this makes the total across ~15 DBs far above the intended 80% process FD budget, which undermines the crash-prevention goal and makes the preceding comment about reserving 20% inaccurate. Consider clamping the final value so that (perDbLimit * EstimatedDbCount) does not exceed systemLimit * 0.8 (and/or logging a warning when the system limit is too low to satisfy the chosen minimum).

Suggested change
int perDbLimit = Math.Max(128, (int)(systemLimit * 0.8 / EstimatedDbCount));
const int MinPerDbLimit = 128;
double totalBudget = systemLimit * 0.8;
int budgetPerDb = (int)(totalBudget / EstimatedDbCount);
int perDbLimit;
if (budgetPerDb < MinPerDbLimit)
{
// Clamp to the per-DB budget so that EstimatedDbCount * perDbLimit does not exceed 80% of the system limit.
perDbLimit = budgetPerDb;
if (_logger.IsWarn)
{
_logger.Warn($"System open files limit {systemLimit} is too low to allocate the desired minimum of {MinPerDbLimit} MaxOpenFiles per database across {EstimatedDbCount} databases. Using reduced per-database limit of {perDbLimit} to stay within 80% of the process budget.");
}
}
else
{
// Use the budget-based per-DB limit; this keeps total usage within ~80% of the system limit.
perDbLimit = budgetPerDb;
}

Copilot uses AI. Check for mistakes.
Comment on lines +103 to +107
// Very low system limit should still give minimum of 128
var factory = new RocksDbConfigFactory(dbConfig, new PruningConfig(), new TestHardwareInfo(0, 100), LimboLogs.Instance);
IRocksDbConfig config = factory.GetForDatabase("State0", null);
// With system limit of 1000, would be 1000 * 0.8 = 800
config.MaxOpenFiles.Should().Be(800);
// With system limit of 100: max(128, 100 * 0.8 / 15) = max(128, 5) = 128
config.MaxOpenFiles.Should().Be(128);

Copilot AI Mar 6, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts that when the detected system FD limit is 100, MaxOpenFiles is set to 128. That means the auto-derived value can exceed the OS-reported limit, which contradicts the intent of auto-limiting and can make the log message misleading. If the implementation is updated to clamp the per-DB limit to the available budget, update this test to assert the clamped value (or adjust the scenario to a realistic system limit where the minimum does not exceed the OS limit).

Copilot uses AI. Check for mistakes.
Setting any finite MaxOpenFiles forces RocksDB into LRU table cache
mode, adding overhead on every read. The original PR #9504 applied
80% of ulimit per-DB unconditionally, causing ~4% AVG / ~6% P90
regression on Linux servers with high ulimits (1048576).

Only cap MaxOpenFiles when the system limit is below 10,000 (macOS
default 256, restricted Docker 1024). On high-limit systems (typical
Linux servers), leave unlimited for best performance. This preserves
the original crash prevention for low-limit systems while eliminating
the performance regression on production servers.
@kamilchodola kamilchodola marked this pull request as ready for review March 6, 2026 19:13
@github-actions

github-actions Bot commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-halfpath-superblocks-fix-rocksdb-max-open-files-delay0s

Metric PR Master (cached) Delta PR vs Master
AVG (ms) 1445.357000 1435.671900 +0.67%
MEDIAN (ms) 1130.620000 1117.195000 +1.20%
P90 (ms) 2519.62 2502.35 +0.69%
P95 (ms) 2650.64 2601.30 +1.90%
P99 (ms) 4024.15 3878.72 +3.75%
MIN (ms) 751.67 751.50 +0.02%
MAX (ms) 4324.64 4529.10 -4.51%

realblocks

Scenario: nethermind-halfpath-realblocks-fix-rocksdb-max-open-files-delay0s

Metric PR Master (cached) Delta PR vs Master
AVG (ms) 40.607081 41.825554 -2.91%
MEDIAN (ms) 23.770000 24.900000 -4.54%
P90 (ms) 109.32 104.82 +4.29%
P95 (ms) 143.12 143.15 -0.02%
P99 (ms) 208.92 203.41 +2.71%
MIN (ms) 0.52 0.55 -5.45%
MAX (ms) 4109.01 4747.22 -13.44%

@benaadams benaadams merged commit 221335a into master Mar 9, 2026
111 checks passed
@benaadams benaadams deleted the fix/rocksdb-max-open-files branch March 9, 2026 23:19
@kamilchodola

Copy link
Copy Markdown
Contributor Author

@claude can you do review of this PR once more to check if it can introduce any regresison on RocksDB Get method on Ubuntu 24?

@claude

claude Bot commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Claude finished @kamilchodola's task in 2m 45s —— View job


Review: RocksDB MaxOpenFiles Fix — Ubuntu 24 Regression Check

Short answer: No, this PR does not introduce a regression on RocksDB Get on Ubuntu 24. For typical server deployments it should improve Get performance.


How MaxOpenFiles affects RocksDB Get

When MaxOpenFiles is a finite value, RocksDB wraps all SST file handles in an LRU table cache. Every Get that touches a file whose handle was evicted must reopen that file, adding latency. When MaxOpenFiles is unlimited (null), RocksDB keeps all SST handles permanently open — no eviction overhead on the read path.


Ubuntu 24 scenarios

Deployment Typical soft RLIMIT_NOFILE Before PR (via #9504) After this PR Get impact
User process (default) 1 024 max(128, 819) = 819 1 024 < 10 000 → same 819 unchanged
systemd service (LimitNOFILE=65536) 65 536 max(128, 52 428) = 52 428 65 536 ≥ 10 000unlimited ✅ improved
Docker container (default) 1 048 576 max(128, 838 860) = 838 860 ≥ 10 000 → unlimited ✅ improved

On Ubuntu 24, the most common Nethermind deployment mode is a systemd service with LimitNOFILE far above the 10 000 threshold. This PR changes that path from a large-but-finite limit to unlimited — removing the LRU table-cache overhead on every Get.

The benchmark results confirm this:

  • realblocks AVG −2.91%, MEDIAN −4.54% — improvement
  • realblocks P90 +4.29% — minor; within noise range for a stochastic workload

Issues found

Low — Pre-existing: RLIMIT_NOFILE = 7 is wrong for macOS
src/Nethermind/Nethermind.Core/HardwareInfo.cs:35

The code uses the hardcoded constant 7 for both Linux and macOS, with the comment "Same value for both macOS and Linux." This is incorrect: Linux uses 7 ✓ but macOS uses 8. On macOS, getrlimit(7, ...) actually queries RLIMIT_NPROC (max processes), returning the process limit instead of the FD limit. This may cause the macOS crash-prevention path introduced in #9504 to read a wrong limit. Ubuntu 24 is unaffected (Linux correctly uses 7). This is a pre-existing bug not introduced by this PR, but worth tracking.

Low — Non-atomic initialization guard
src/Nethermind/Nethermind.Db.Rocks/Config/RocksDbConfigFactory.cs:20

private bool _maxOpenFilesInitialized;

_maxOpenFilesInitialized is read/written without any synchronization primitive (volatile, Interlocked, or lock). If GetForDatabase is called concurrently from multiple threads during initialization, two threads could both observe false and both set dbConfig.MaxOpenFiles. Since they'd compute the same value, this is a benign race in practice — but it contradicts the intent of "only once" and could cause confusing debug logs if the log line fires twice. On Ubuntu 24 this is not a performance or correctness regression.

Low — Threshold edge (10 000) silently switches to unlimited
If a deployment has exactly LimitNOFILE=10000 (or anything from 10 000 to ~65 000), this PR switches from a finite per-DB limit to unlimited with no log message. The Debug log fires but only when MaxOpenFiles is left at null. If those DBs collectively open many SST files on a 10 001-FD system they could exhaust the budget. RocksDB handles open() failures gracefully via its table cache, so no crash is expected — but silent FD exhaustion could slow down Gets unexpectedly. Consider logging the decision at Info level for this case too.


Verdict

  • No Critical, High, or Medium findings
  • The PR does not introduce a RocksDB Get regression on Ubuntu 24
  • Server deployments (Ubuntu 24 + systemd) will likely see a measurable Get improvement due to the unlimited table cache handles
  • The three Low findings above are worth tracking but do not affect correctness or performance on Ubuntu 24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants