fix: correct RocksDB MaxOpenFiles formula to divide by DB count#10736
Conversation
The auto-limit logic (PR #9504) set MaxOpenFiles to 80% of the system ulimit per database, but with ~15 databases the combined limit was 12x the OS total. This didn't prevent the macOS crash it was designed to fix, and caused a ~43% P90 block processing regression on Linux by forcing RocksDB to use an LRU file handle cache instead of keeping all SST handles open. Fix: properly divide by the estimated database count (~15) so the combined file handle usage stays within the OS limit.
9bb3d5a to
181c224
Compare
|
Dont reduce the limit dammit. And dont make it unlimited. |
|
You cant divide it by num of DB. Some db take up a lot of file, others dont. Thats why per db it is 80%. |
should we tweak it more explicitly for some dbs? I expect performance here is mostly about state db. |
There was a problem hiding this comment.
Pull request overview
Adjusts the auto-derived RocksDB MaxOpenFiles value to account for the fact that the OS file-descriptor limit is shared across multiple DB instances, aligning the configuration with the original goal of preventing FD exhaustion (notably on macOS).
Changes:
- Update
RocksDbConfigFactoryto divide the 80%-of-system-limit budget by an estimated DB count (~15) when auto-settingMaxOpenFiles. - Update
RocksDbConfigFactoryTeststo validate the new per-DB limit calculation for low and high system limits. - Adjust the minimum enforced
MaxOpenFilesexpectation in tests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/Nethermind/Nethermind.Db.Rocks/Config/RocksDbConfigFactory.cs |
Changes the auto-limit formula to divide the FD budget by an estimated number of DBs. |
src/Nethermind/Nethermind.Db.Test/RocksDbConfigFactoryTests.cs |
Updates/extends unit tests to reflect the revised MaxOpenFiles calculation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| int perDbLimit = Math.Max(128, (int)(systemLimit * 0.8 / EstimatedDbCount)); | ||
|
|
There was a problem hiding this comment.
The new formula can still choose a per-DB MaxOpenFiles (128) that exceeds the computed per-DB budget (systemLimit * 0.8 / EstimatedDbCount). For low system limits this makes the total across ~15 DBs far above the intended 80% process FD budget, which undermines the crash-prevention goal and makes the preceding comment about reserving 20% inaccurate. Consider clamping the final value so that (perDbLimit * EstimatedDbCount) does not exceed systemLimit * 0.8 (and/or logging a warning when the system limit is too low to satisfy the chosen minimum).
| int perDbLimit = Math.Max(128, (int)(systemLimit * 0.8 / EstimatedDbCount)); | |
| const int MinPerDbLimit = 128; | |
| double totalBudget = systemLimit * 0.8; | |
| int budgetPerDb = (int)(totalBudget / EstimatedDbCount); | |
| int perDbLimit; | |
| if (budgetPerDb < MinPerDbLimit) | |
| { | |
| // Clamp to the per-DB budget so that EstimatedDbCount * perDbLimit does not exceed 80% of the system limit. | |
| perDbLimit = budgetPerDb; | |
| if (_logger.IsWarn) | |
| { | |
| _logger.Warn($"System open files limit {systemLimit} is too low to allocate the desired minimum of {MinPerDbLimit} MaxOpenFiles per database across {EstimatedDbCount} databases. Using reduced per-database limit of {perDbLimit} to stay within 80% of the process budget."); | |
| } | |
| } | |
| else | |
| { | |
| // Use the budget-based per-DB limit; this keeps total usage within ~80% of the system limit. | |
| perDbLimit = budgetPerDb; | |
| } |
| // Very low system limit should still give minimum of 128 | ||
| var factory = new RocksDbConfigFactory(dbConfig, new PruningConfig(), new TestHardwareInfo(0, 100), LimboLogs.Instance); | ||
| IRocksDbConfig config = factory.GetForDatabase("State0", null); | ||
| // With system limit of 1000, would be 1000 * 0.8 = 800 | ||
| config.MaxOpenFiles.Should().Be(800); | ||
| // With system limit of 100: max(128, 100 * 0.8 / 15) = max(128, 5) = 128 | ||
| config.MaxOpenFiles.Should().Be(128); |
There was a problem hiding this comment.
This test asserts that when the detected system FD limit is 100, MaxOpenFiles is set to 128. That means the auto-derived value can exceed the OS-reported limit, which contradicts the intent of auto-limiting and can make the log message misleading. If the implementation is updated to clamp the per-DB limit to the available budget, update this test to assert the clamped value (or adjust the scenario to a realistic system limit where the minimum does not exceed the OS limit).
Setting any finite MaxOpenFiles forces RocksDB into LRU table cache mode, adding overhead on every read. The original PR #9504 applied 80% of ulimit per-DB unconditionally, causing ~4% AVG / ~6% P90 regression on Linux servers with high ulimits (1048576). Only cap MaxOpenFiles when the system limit is below 10,000 (macOS default 256, restricted Docker 1024). On high-limit systems (typical Linux servers), leave unlimited for best performance. This preserves the original crash prevention for low-limit systems while eliminating the performance regression on production servers.
EXPB Benchmark ComparisonRun: View workflow run superblocksScenario:
realblocksScenario:
|
|
@claude can you do review of this PR once more to check if it can introduce any regresison on RocksDB Get method on Ubuntu 24? |
|
Claude finished @kamilchodola's task in 2m 45s —— View job Review: RocksDB MaxOpenFiles Fix — Ubuntu 24 Regression CheckShort answer: No, this PR does not introduce a regression on RocksDB Get on Ubuntu 24. For typical server deployments it should improve Get performance. How
|
| Deployment | Typical soft RLIMIT_NOFILE |
Before PR (via #9504) | After this PR | Get impact |
|---|---|---|---|---|
| User process (default) | 1 024 | max(128, 819) = 819 |
1 024 < 10 000 → same 819 |
unchanged |
systemd service (LimitNOFILE=65536) |
65 536 | max(128, 52 428) = 52 428 |
65 536 ≥ 10 000 → unlimited |
✅ improved |
| Docker container (default) | 1 048 576 | max(128, 838 860) = 838 860 |
≥ 10 000 → unlimited | ✅ improved |
On Ubuntu 24, the most common Nethermind deployment mode is a systemd service with LimitNOFILE far above the 10 000 threshold. This PR changes that path from a large-but-finite limit to unlimited — removing the LRU table-cache overhead on every Get.
The benchmark results confirm this:
realblocksAVG −2.91%, MEDIAN −4.54% — improvementrealblocksP90 +4.29% — minor; within noise range for a stochastic workload
Issues found
Low — Pre-existing: RLIMIT_NOFILE = 7 is wrong for macOS
src/Nethermind/Nethermind.Core/HardwareInfo.cs:35
The code uses the hardcoded constant 7 for both Linux and macOS, with the comment "Same value for both macOS and Linux." This is incorrect: Linux uses 7 ✓ but macOS uses 8. On macOS, getrlimit(7, ...) actually queries RLIMIT_NPROC (max processes), returning the process limit instead of the FD limit. This may cause the macOS crash-prevention path introduced in #9504 to read a wrong limit. Ubuntu 24 is unaffected (Linux correctly uses 7). This is a pre-existing bug not introduced by this PR, but worth tracking.
Low — Non-atomic initialization guard
src/Nethermind/Nethermind.Db.Rocks/Config/RocksDbConfigFactory.cs:20
private bool _maxOpenFilesInitialized;_maxOpenFilesInitialized is read/written without any synchronization primitive (volatile, Interlocked, or lock). If GetForDatabase is called concurrently from multiple threads during initialization, two threads could both observe false and both set dbConfig.MaxOpenFiles. Since they'd compute the same value, this is a benign race in practice — but it contradicts the intent of "only once" and could cause confusing debug logs if the log line fires twice. On Ubuntu 24 this is not a performance or correctness regression.
Low — Threshold edge (10 000) silently switches to unlimited
If a deployment has exactly LimitNOFILE=10000 (or anything from 10 000 to ~65 000), this PR switches from a finite per-DB limit to unlimited with no log message. The Debug log fires but only when MaxOpenFiles is left at null. If those DBs collectively open many SST files on a 10 001-FD system they could exhaust the budget. RocksDB handles open() failures gracefully via its table cache, so no crash is expected — but silent FD exhaustion could slow down Gets unexpectedly. Consider logging the decision at Info level for this case too.
Verdict
- No Critical, High, or Medium findings
- The PR does not introduce a RocksDB Get regression on Ubuntu 24
- Server deployments (Ubuntu 24 + systemd) will likely see a measurable Get improvement due to the unlimited table cache handles
- The three Low findings above are worth tracking but do not affect correctness or performance on Ubuntu 24
Summary
MaxOpenFilesauto-limit formula to properly divide by estimated database count (~15)MaxOpenFilesis set per databaseProblem
PR #9504 introduced auto-detection of the system open file limit to prevent macOS crashes during sync. The formula
perDbLimit = 80% × systemLimitwas applied per database, but with ~15 databases the combined limit was15 × 0.8 × limit = 12× the OS limit. This:Fix
Divide by estimated DB count:
perDbLimit = max(128, systemLimit × 0.8 / 15)Changes
Test plan
RocksDbConfigFactoryTests— all 11 tests passrealblockspayload set to verify P90 recovery