Skip to content

perf: Remove unnecessary caches#5439

Merged
bthomee merged 41 commits intodevelopfrom
bthomee/disable-cache
Feb 6, 2026
Merged

perf: Remove unnecessary caches#5439
bthomee merged 41 commits intodevelopfrom
bthomee/disable-cache

Conversation

@bthomee
Copy link
Copy Markdown
Collaborator

@bthomee bthomee commented May 20, 2025

High Level Overview of Change

This change removes the cache in DatabaseNodeImp and simplifies the caching logic in SHAMapStoreImp.

Context of Change

The codebase contains many caches, and this PR is one of several to refactor the codebase with as goal to improve performance, such as reducing memory use, to support future account, trust line, and transaction growth. In this case. as NuDB and RocksDB internally already use caches, additional caches in the code are not very valuable or may even be unnecessary. A number of performance analyses demonstrated improved performance, and more extensive performance analyses will be performed before this change is released.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Performance (increase or change in throughput and/or latency)
  • Tests (you added tests for code that already exists, or your new feature included in this PR)
  • Documentation update
  • Chore (no impact to binary, e.g. .gitignore, formatting, dropping support for older tooling)
  • Release

@bthomee bthomee added the QE test required RippleX QE Team must look at this PR. label May 20, 2025
@bthomee bthomee marked this pull request as ready for review May 20, 2025 14:15
@bthomee bthomee requested review from a team and vlntb May 20, 2025 14:15
@codecov
Copy link
Copy Markdown

codecov bot commented May 20, 2025

Codecov Report

❌ Patch coverage is 29.62963% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.9%. Comparing base (25d7c2c) to head (32d1414).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
src/libxrpl/nodestore/DatabaseNodeImp.cpp 20.8% 19 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5439   +/-   ##
=======================================
  Coverage     79.9%   79.9%           
=======================================
  Files          840     840           
  Lines        65549   65483   -66     
  Branches      7266    7250   -16     
=======================================
- Hits         52352   52318   -34     
+ Misses       13197   13165   -32     
Files with missing lines Coverage Δ
include/xrpl/nodestore/Database.h 69.2% <ø> (ø)
include/xrpl/nodestore/detail/DatabaseNodeImp.h 72.7% <ø> (-3.7%) ⬇️
...nclude/xrpl/nodestore/detail/DatabaseRotatingImp.h 66.7% <ø> (ø)
src/libxrpl/nodestore/DatabaseRotatingImp.cpp 63.9% <ø> (+0.8%) ⬆️
src/xrpld/app/main/Application.cpp 70.9% <ø> (+0.1%) ⬆️
src/xrpld/app/misc/SHAMapStoreImp.cpp 76.2% <100.0%> (+0.4%) ⬆️
src/xrpld/app/misc/SHAMapStoreImp.h 95.7% <ø> (ø)
src/libxrpl/nodestore/DatabaseNodeImp.cpp 36.6% <20.8%> (+2.5%) ⬆️

... and 6 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PeterChen13579
Copy link
Copy Markdown
Contributor

Just looping in @kuznetsss before approving, I don’t believe we do any node-level DB caching in Clio, right? So no changes should be needed on our side?

@kuznetsss
Copy link
Copy Markdown
Contributor

As I understand this is rippled's internal database. So no changes on Clio's side are required.

@bthomee bthomee added this to the 2.6.0 (Q3 2025) milestone May 27, 2025
@bthomee bthomee removed this from the 2.6.0 (Q3 2025) milestone Aug 13, 2025
@sublimator
Copy link
Copy Markdown
Contributor

I was looking into memory usago a few weeks ago too and noticed some oddities with the node store cache. Sharing my notes.


So regarding the NodeStore cache, like you, I'm of the opinion that it is actually largely pointless.

  • Peer messages use the SHAMap as source of truth and use a custom wire encoding
  • There's already a TreeNodeCache for SHAMap nodes that you may want to have laying around

Why was it ever there!? Well, it seems like back in the day, for some reason the TreeNodeCache was actually keyed by an id representing a node's position in the tree. This was also before NUDB. Think LevelDB and SQLite. So you had the tree node cache often missing, and a slow k/v store, so I'm guessing it was a backup cache.

It's kind of crazy, we read from the node store, then create a DecodedBlob from decompressed data, only to immediately just make a copy of that, and then cache that "because reasons"

Now days? Shrug. It seems you can disable it via setting cache_age=0 and cache_size=0 in the [node_db] configuration.


Attached Claude nodes when I did the investigation


NodeObject Cache Investigation

Summary of Findings

After investigating the NodeObject cache in rippled/xahaud, it appears to serve no clear purpose in the current architecture. Evidence suggests it might be a remnant from an earlier design that's now just adding overhead to node fetch operations. But without access to original design discussions, we can't be completely certain.

Historical Investigation

A Theory: TreeNodeCache Was Originally Keyed by NodeID

Looking at the July 2013 codebase (commit c645901), there's interesting evidence that might explain why the NodeObject cache exists. The code suggests there may have been a fundamental issue with how the TreeNodeCache was implemented.

What the 2013 Code Shows:

  1. The mTNByID Map (ripple_SHAMap.h:225):

    boost::unordered_map<SHAMapNode, SHAMapTreeNode::pointer> mTNByID;

    The map name itself - "Tree Nodes By ID" - suggests it was keyed by NodeID, not hash.

  2. The HashedObjectStore Cache (ripple_HashedObjectStore.h):

    class HashedObjectStore
    {
    public:
        HashedObjectStore (int cacheSize, int cacheAge);
        
        HashedObject::pointer retrieve (uint256 const& hash)
        {
            if (mLevelDB)
                return retrieveLevelDB (hash);
            return retrieveSQLite (hash);
        }
    
    private:
        TaggedCache<uint256, HashedObject, UptimeTimerAdapter> mCache;
        KeyCache <uint256, UptimeTimerAdapter> mNegativeCache;

    The HashedObjectStore had its own cache keyed by hash (uint256), separate from the TreeNodeCache.

  3. How getNode() Worked (ripple_SHAMap.cpp:204):

    SHAMapTreeNode::pointer node = checkCacheNode(id);  // Lookup by NodeID
    if (node) {
    #ifdef DEBUG
        if (node->getNodeHash() != hash)  // Then verify hash matches
            throw std::runtime_error("invalid node");
    #endif

    It seems to look up by NodeID first, then check if the hash matches - which could be problematic.

  4. The Fallback Path (ripple_SHAMap.cpp:820):

    SHAMapTreeNode::pointer SHAMap::fetchNodeExternalNT(const SHAMapNode& id, uint256 const& hash)
    {
        // When mTNByID misses...
        HashedObject::pointer obj(theApp->getHashedObjectStore().retrieve(hash));

    When the ID-based lookup failed, it fell back to a hash-based lookup in the HashedObjectStore.

  5. How HashedObjectStore Retrieved Objects (ripple_HashedObjectStore.cpp):

    HashedObject::pointer HashedObjectStore::retrieveLevelDB (uint256 const& hash)
    {
        HashedObject::pointer obj = mCache.fetch (hash);
        
        if (obj || mNegativeCache.isPresent (hash) || !theApp || !theApp->getHashNodeLDB ())
            return obj;
        
        // ... fetch from LevelDB if not in cache
        obj = LLRetrieve (hash, theApp->getHashNodeLDB ());
        
        if (!obj)
        {
            mNegativeCache.add (hash);
            return obj;
        }
        
        mCache.canonicalize (hash, obj);
        return obj;
    }

    The HashedObjectStore maintained its own cache using the hash as the key, which would succeed even when mTNByID missed.

This could have created an interesting problem:

  • The same content (same hash) might exist at different positions (different IDs) in different trees
  • An ID-based cache would miss when looking for that content at a different position
  • A hash-based cache would find it

Could NodeObject Cache Have Been a Workaround?

If the above theory is correct, the NodeObject cache (keyed by hash) might have been added to catch what the TreeNodeCache (keyed by ID) missed. This would explain why there were two caches doing seemingly similar things.

It's worth noting that in 2013, the backend databases (SQLite and LevelDB) would have been considerably slower than modern options like NuDB or RocksDB. Given these performance characteristics, having a secondary cache to avoid repeated database lookups when the TreeNodeCache missed would have made more sense.

What the 2014 Documentation Says

By July 2014, the team seemed aware of these issues. From the SHAMap README (commit ea44497):

"When we navigate the tree (say, like SHAMap::walkTo()) we currently ask each node for information that we could determine locally... The next refactor should remove all calls to SHAMapTreeNode::GetID()... Then we can change the SHAMap::mTNByID member to be mTNByHash."

This suggests they wanted to move away from ID-based lookups entirely.

Current State

Today's code uses direct pointer traversal:

std::shared_ptr<SHAMapTreeNode> node = parent->getChild(branch);
if (node || !backed_)
    return node;

It seems like the architecture has evolved significantly, but the NodeObject cache remains - possibly as a remnant of the old workaround.

Current Caching Layers

Looking at the current code, there appear to be three caching layers:

1. TreeNodeCache

  • Caches fully parsed tree nodes
  • Seems useful for historical ledger queries
  • Uses the modern hash-based lookup

2. NodeObject Cache

  • Caches raw serialized bytes WITHOUT the 9-byte prefix (the prefix is stripped during the copy)
  • Unclear what benefit it provides today
  • Forces an extra copy operation on every fetch to remove the prefix

3. In-Memory SHAMap

  • Direct pointer tree structure for current ledgers
  • Handles most operations without any cache lookups

Observed Inefficiencies

The Copy Problem

Every node fetch from disk appears to do an unnecessary copy:

// In DecodedBlob::createObject()
Blob data(m_objectData, m_objectData + m_dataBytes);  // COPY!

This happens because:

  • The decompressed data includes a 9-byte prefix
  • NodeObject doesn't want the prefix
  • So it copies everything except the first 9 bytes

Where NodeObject Cache Doesn't Seem to Help

  1. Current Ledger Operations: Use direct pointers, no cache needed
  2. Historical Queries: TreeNodeCache already has the parsed nodes
  3. P2P Syncing: Uses its own wire format
  4. Startup: Cache is empty, so it's just overhead
  5. RPC Queries: TreeNodeCache handles these better

Questions and Uncertainties

  • Was the NodeObject cache really a workaround for the ID-based TreeNodeCache? The evidence suggests it, but we can't be certain.
  • Are there any edge cases where the NodeObject cache provides value that we're missing?
  • Why wasn't it removed when the architecture changed? Fear of breaking something? Or does it serve some purpose we haven't identified?

Potential Impact of Removal

If the NodeObject cache truly serves no purpose, removing it could:

  • Eliminate the unnecessary copy on every disk read
  • Reduce memory usage
  • Simplify the code
  • Possibly improve performance

But without comprehensive testing, it's hard to be certain there aren't hidden dependencies.

Disabling the Cache

Interestingly, the NodeObject cache is already optional in the codebase. The cache initialization logic in DatabaseNodeImp.h shows how it can be disabled:

// Lines 46-76 from DatabaseNodeImp.h
std::optional<int> cacheSize, cacheAge;

if (config.exists("cache_size"))
{
    cacheSize = get<int>(config, "cache_size");
    if (cacheSize.value() < 0)
    {
        Throw<std::runtime_error>(
            "Specified negative value for cache_size");
    }
}

if (config.exists("cache_age"))
{
    cacheAge = get<int>(config, "cache_age");
    if (cacheAge.value() < 0)
    {
        Throw<std::runtime_error>(
            "Specified negative value for cache_age");
    }
}

if (cacheSize != 0 || cacheAge != 0)
{
    cache_ = std::make_shared<TaggedCache<uint256, NodeObject>>(
        "DatabaseNodeImp",
        cacheSize.value_or(0),
        std::chrono::minutes(cacheAge.value_or(0)),
        stopwatch(),
        j);
}

The key line is if (cacheSize != 0 || cacheAge != 0) - the cache is only created if at least one of these values is non-zero. Setting both to 0 leaves cache_ as nullptr.

Throughout DatabaseNodeImp.cpp, the code consistently checks if cache_ exists before using it:

// Line 55: storing objects
if (cache_ && !skipCache)

// Line 71: fetching objects  
if (cache_)

// Line 98: conditional fetch
cache_ ? cache_->fetch(hash) : nullptr

This means the cache can be disabled through configuration by setting both cache_age and cache_size to 0. When disabled:

  • All cache operations are skipped
  • The code falls back to direct backend access
  • No performance penalty from the unnecessary copy operation

This provides an easy way to test whether the cache provides any real benefit - just disable it and measure the impact.

Additional Problems: P2P Cache Pollution

Another major issue with the NodeObject cache is that peer requests pollute it with one-off data. When peers request nodes via TMGetObjectByHash, the code fetches from NodeStore, which adds the data to the cache even though it's quite possibly never going to be used again, especially if it's for a historical ledger.

This means if you have the cache configured, it will continuously grow with useless data every time peers request nodes from you. The cache fills with garbage that has near-zero probability of reuse, evicting any potentially useful entries. It's essentially uncontrolled memory growth triggered by external peer requests. See .ai-docs/nodeobject-cache-p2p-fix.md for a proposed solution.

Recommendations for Further Investigation

  1. Test with cache_size=0 and cache_age=0 to disable the cache completely
  2. Run benchmarks comparing performance with and without the cache
  3. Monitor memory usage and query latency in both configurations
  4. Look for any unexpected behavior when cache is disabled
  5. Investigate P2P request patterns to quantify cache pollution

The fact that the cache is already optional suggests the developers recognized it might not always be necessary. Testing with it disabled would be a low-risk way to validate whether it's truly vestigial.

@bthomee
Copy link
Copy Markdown
Collaborator Author

bthomee commented Aug 14, 2025

@sublimator I loved the historical context and the results from your deep dive - the amount of detail is very much appreciated!

As for your recommendations, our old performance analysis platform didn't let me modify the rippled.cfg, hence I created this PR in which I completely removed the code so I could evaluate its memory and cpu usage. It did show memory improvements at a slight increase in CPU. However, as our platform turned out to have some shortcomings (for one, it measured system memory rather than process memory, while it was being run on a machine in the cloud that can be shared with an arbitrary number of other services), we have been working on a new platform. This new platform is just about ready, so I'll be rerunning the experiment there to ensure there indeed performance improvements without any ill side-effects.

@sublimator
Copy link
Copy Markdown
Contributor

@bthomee

You're welcome, I assumed you could recontextualize/summarize it!
Was my Claude notes.

netOPs_ = &app_.getOPs();
ledgerMaster_ = &app_.getLedgerMaster();
fullBelowCache_ = &(*app_.getNodeFamily().getFullBelowCache());
treeNodeCache_ = &(*app_.getNodeFamily().getTreeNodeCache());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bthomee

It seems to be used by the SHAMap code itself. Tree node cache is more than a simple cache and is used for canonicalizing nodes. I was just deep diving into ripple'd shamap recently, and then I recalled this PR.

   void SHAMap::canonicalize(SHAMapHash const& hash, std::shared_ptr<SHAMapTreeNode>& node) const
   {
       assert(backed_);
       assert(node->cowid() == 0);  // Only canonical nodes can be cached
       assert(node->getHash() == hash);
   
       f_.getTreeNodeCache(ledgerSeq_)
           ->canonicalize_replace_client(hash.as_uint256(), node);
   }

It's what is used to make sure there's only one instance of a given node when structurally sharing trees.

All you're doing here is removing the SHAMapStoreImp reference, but maybe that clear() is actually useful ? I'm not sure when/why/how clearCaches() is being called, but it might actually be doing something beneficial to keeping memory usage down there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait, sorry it's calling freshenCache/freshenCaches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    template <class CacheInstance>
    bool
    freshenCache(CacheInstance& cache)
    {
        std::uint64_t check = 0;

        for (auto const& key : cache.getKeys())
        {
            dbRotating_->fetchNodeObject(
                key, 0, NodeStore::FetchType::synchronous, true); // <- duplicate = true
            if (!(++check % checkHealthInterval_) && healthWait() == stopping)
                return true;
        }

        return false;
    }

By my reading 'freshenCache' is actually forcing any memory resident nodes to actually be 'duplicate=true'd written to the new backend during rotation. It's simply using the cache keys to determine what map nodes are in memory, rather than "freshen" that particular cache.

See:
https://github.com/XRPLF/rippled/blob/cb52c9af001316a906665a10283f6f5dd271ebb3/src/xrpld/nodestore/detail/DatabaseRotatingImp.cpp#L132C10-L189

Given you can't actually remove the tree node cache, as its relied upon for canonicalizing nodes, the question becomes one of whether you want the new backend to have those keys.

Copy link
Copy Markdown
Contributor

@vlntb vlntb Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great research, @sublimator. I followed your notes and agree excluding (i.e. shrinking/zero-sizing) treeNodeCache_ can hurt rotation performance. The freshenCache pass uses the cache’s keys to warm the new backend—if the cache is empty, we lose that warm-start and we’ll see more cold misses immediately after rotation.

We should keep freshening the cache using the tree node cache, but we don’t need to keep it as a class member variable (dropping the member pointer doesn’t save memory—just fetch at the call site and pass *ptr).

The other realisation from your comment is that we’re currently trying to make the memory footprint of rippled smaller, which may lead to increased internal latency / decreased throughput—similar to the behaviour we see around DB rotation.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great feedback and discussion - I restored the freshenCaches.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw these comments, thanks :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to 80
auto results = backend_->fetchBatch(batch).first;
for (size_t i = 0; i < results.size(); ++i)
{
auto nObj = std::move(dbResults[i]);
size_t index = indexMap[cacheMisses[i]];
auto const& hash = hashes[index];

if (nObj)
{
// Ensure all threads get the same object
if (cache_)
cache_->canonicalize_replace_client(hash, nObj);
}
else
if (!results[i])
{
JLOG(j_.error()) << "fetchBatch - "
<< "record not found in db or cache. hash = " << strHex(hash);
if (cache_)
{
auto notFound = NodeObject::createObject(hotDUMMY, {}, hash);
cache_->canonicalize_replace_client(hash, notFound);
if (notFound->getType() != hotDUMMY)
nObj = std::move(notFound);
}
<< "record not found in db. hash = " << strHex(hashes[i]);
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchBatch now returns backend_->fetchBatch(batch).first directly. This can return a vector whose size does not match hashes.size() (e.g., the NullBackend implementation returns an empty pair), which is a behavioral change from the prior code that always returned hashes.size() entries. Consider enforcing/normalizing the result size (and/or handling a non-ok status) so callers can safely assume positional correspondence with the input hashes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert to confirm. However, implementations in MemoryFactory, NuDBFactory, and RocksDBFactory show that the same number of output objects are created as input hashes are provided.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert to confirm. However, implementations in MemoryFactory, NuDBFactory, and RocksDBFactory show that the same number of output objects are created as input hashes are provided.

I think @copilot is wrong here. The original code did a move assignment, which updates the size() of results to match whatever was returned. The initial size is essentially ignored.

You can see what I'm talking about with this unit test case snippet:

        std::vector<std::string> a{std::size_t{1024}};
        std::vector<std::string> b{std::size_t{2}};
        std::vector<std::string> c{std::size_t{2048}};
        BEAST_EXPECT(a.size() == 1024);
        BEAST_EXPECT(b.size() == 2);
        BEAST_EXPECT(c.size() == 2048);
        a = b;
        BEAST_EXPECT(a.size() == 2);
        BEAST_EXPECT(b.size() == 2);
        BEAST_EXPECT(c.size() == 2048);
        c = std::move(b);
        BEAST_EXPECT(a.size() == 2);
        BEAST_EXPECT(b.size() == 0);
        BEAST_EXPECT(c.size() == 2);

I'd go so far as to say that the assert is wrong, because as it said, NullBackend intentionally returns an empty vector.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked back again at the original code, and noticed that it guaranteed that result had the same size as hashes by creating it first, then copying the individual items over by index after canonicalizing. (I also realized that there was no bounds checking, so that if the backend_ returned a larger array, it would try to write past the end of results.)

To preserve that behavior, I suggest adding results.resize(hashes.size()); after the vector is created.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Note that there's an implicit assumption that results[i] refers t the same element as hashes[i]] (both in the new and old code). If the backend gives us back more or fewer results than input hashes, then perhaps it would be useful to raise a bigger stink about it - if the number of results is different, then how can we trust they are even in the right order?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there's an implicit assumption that results[i] refers t the same element as hashes[i]] (both in the new and old code). If the backend gives us back more or fewer results than input hashes, then perhaps it would be useful to raise a bigger stink about it - if the number of results is different, then how can we trust they are even in the right order?

That's a good question, in both the old an new code. Perhaps have an assertion before resizing that results.size() == hashes.size() || results.empty(). If it's empty, then none of the items were found. Or do you think it should be a bigger stink like a throw or LogicError?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the assertion. I prefer to play it safe and find the error in the logs, than that all nodes on the network crash if we missed an edge case.

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 3, 2026

@ximinez I've opened a new pull request, #6323, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on existing threads

Copilot AI review requested due to automatic review settings February 3, 2026 22:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 4, 2026 19:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@ximinez ximinez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but either way this looks good.

// Also clear the FullBelowCache so its generation counter is bumped.
// This prevents stale "full below" markers from persisting across
// backend rotation/online deletion and interfering with SHAMap sync.
app_.getNodeFamily().getFullBelowCache()->clear();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check that getFullBelowCache() doesn't return nullptr? I don't think the original implementation did.

Copy link
Copy Markdown
Collaborator Author

@bthomee bthomee Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the full below cache is created in the NodeFamily constructor, which in turn is created in the constructor of ApplicationImp, and that is called via make_Application in Main.cpp.

In no place do I see where the full below cache could become invalidated, since the only functions that can reach the cache are in the application, and the node family that holds the cache will only release it once the application is destroyed.

(If we should check for nullptr then there are many other places where it can go wrong too - see the freshenCaches function below.)

@bthomee bthomee added QE test required RippleX QE Team must look at this PR. and removed QE test required RippleX QE Team must look at this PR. labels Feb 5, 2026
@bthomee bthomee merged commit 677758b into develop Feb 6, 2026
27 checks passed
@bthomee bthomee deleted the bthomee/disable-cache branch February 6, 2026 14:42
gregtatcam added a commit to gregtatcam/rippled that referenced this pull request Mar 7, 2026
* chore: Set ColumnLimit to 120 in clang-format (XRPLF#6288)

This change updates the ColumnLimit from 80 to 120, and applies clang-format to reformat the code.

* chore: Remove unnecessary `boost::system` requirement from conanfile (XRPLF#6290)

* chore: Add cmake-format pre-commit hook (XRPLF#6279)

This change adds `cmake-format` as. a pre-commit hook. The style file closely matches that in Clio, and they will be made to be equivalent over time. For now, some files have been excluded, as those need some manual adjustments, which will be done in future changes.

* chore: Format all cmake files without comments (XRPLF#6294)

* ci: Update hashes of XRPLF/actions (XRPLF#6316)

This updates the hashes of all XRPLF/actions to their latest versions.

* chore: Add upper-case match for ARM64 in CompilationEnv (XRPLF#6315)

* fix: Restore config changes that broke standalone mode (XRPLF#6301)

When support was added for `xrpld.cfg` in addition to `rippled.cfg` in XRPLF#6098, as part of an effort to rename occurrences of ripple(d) to xrpl(d), the clearing and creation of the data directory were modified for what, at the time, seemed to result in an equivalent code flow. This has turned out to not be true, which is why this change restores two modifications to `Config.cpp` that currently break running the binary in standalone mode.

* docs: Update API changelog, add APIv2+APIv3 version documentation (XRPLF#6308)

This change cleans up the `API-CHANGELOG.md` file. It moves the version-specific documentation to other files and fleshes out the changelog with all the API-related changes in each version.

* chore: Add .zed editor config directory to .gitignore (XRPLF#6317)

This change adds the project configuration directory to `.gitignore` for the `zed` editor. 

As per the [documentation](https://zed.dev/docs/remote-development?highlight=.zed#zed-settings), the project configuration files are stored in the `.zed` directory at the project root dir.

* fix: Deletes expired NFToken offers from ledger (XRPLF#5707)

This change introduces the `fixExpiredNFTokenOfferRemoval` amendment that allows expired offers to pass through `preclaim()` and be deleted in `doApply()`, following the same pattern used for expired credentials.

* refactor: Add ServiceRegistry to help modularization (XRPLF#6222)

Currently we're passing the `Application` object around, whereby the `Application` class acts more like a service registry that gives other classes access to other services. In order to allow modularization, we should replace `Application` with a service registry class so that modules depending on `Application` for other services can be moved easily. This change adds the `ServiceRegistry` class.

* chore: Remove unity builds (XRPLF#6300)

Unity builds were intended to speed up builds, by bundling multiple files into compilation units. However, now that ccache is available on all platforms, there is no need for unity builds anymore, as ccache stores compiled individual build objects for reuse. This change therefore removes the ability to make unity builds.

* refactor: Replace include guards by '#pragma once' (XRPLF#6322)

This change replaces all include guards in the `src/` and `include/` directories by `#pragma once`.

* chore: Remove unnecessary script (XRPLF#6326)

* chore: Update secp256k1 and openssl (XRPLF#6327)

* fix typo in LendingHelpers unit-test (XRPLF#6215)

* fix: Increment sequence when accepting new manifests (XRPLF#6059)

The `ManifestCache::applyManifest` function was returning early without incrementing `seq_`. `OverlayImpl `uses this sequence to identify/invalidate a cached `TMManifests` message, which is exchanged with peers on connection. Depending on network size, startup sequencing, and topology, this can cause syncing issues. This change therefore increments `seq_` when a new manifest is accepted.

* refactor: Update secp256k1 to 0.7.1 (XRPLF#6331)

The latest secp256k1 release, 0.7.1, contains bug fixes that we may benefit from, see https://github.com/bitcoin-core/secp256k1/blob/master/CHANGELOG.md.

* chore: Restore unity builds (XRPLF#6328)

In certain cases, such as when modifying headers used by many compilation units, performing a unity build is slower than when performing a regular build with `ccache` enabled. There is also a benefit to a unity build in that it can detect things such as macro redefinitions within the group of files that are compiled together as a unit. This change therefore restores the ability to perform unity builds. However, instead of running every configuration with and without unity enabled, it is now only enabled for a single configuration to maintain lower computational use.

As part of restoring the code, it became clear that currently two configurations have coverage enabled, since the check doesn't focus specifically on Debian Bookworm so it also applies to Debian Trixie. This has been fixed too in this change.

* perf: Remove unnecessary caches (XRPLF#5439)

This change removes the cache in `DatabaseNodeImp` and simplifies the caching logic in `SHAMapStoreImp`. As NuDB and RocksDB internally already use caches, additional caches in the code are not very valuable or may even be unnecessary, as also confirmed during preliminary performance analyses.

* chore: Remove CODEOWNERS (XRPLF#6337)

* test: Add file and line location to Env (XRPLF#6276)

This change uses `std::source_location` to output the file and line location of the call that triggered a failed transaction.

* refactor: Fix spelling issues in tests (XRPLF#6199)

This change removes the `src/tests` exception from the `cspell` config and fixes all the issues that arise as a result. No functionality/test change.

* refactor: Change main thread name to `xrpld-main` (XRPLF#6336)

This change builds on the thread-renaming PR (XRPLF#6212), by renaming the main thread name to reduce ambiguity in performance monitoring tools.

* fix: Update invariant checks for Permissioned Domains (XRPLF#6134)

* refactor: Modularize WalletDB and Manifest (XRPLF#6223)

This change modularizes the `WalletDB` and `Manifest`. Note that the wallet db has nothing to do with account wallets and it stores node configuration, which is why it depends on the manifest code.

* refactor: Modularize RelationalDB (XRPLF#6224)

The rdb module was not properly designed, which is fixed in this change. The module had three classes:
1) The abstract class `RelationalDB`.
2) The abstract class `SQLiteDatabase`, which inherited from `RelationalDB` and added some pure virtual methods.
3) The concrete class `SQLiteDatabaseImp`, which inherited from `SQLiteDatabase` and implemented all methods.

The updated code simplifies this as follows:
* The `SQLiteDatabaseImp` has become `SQLiteDatabase`, and
* The former `SQLiteDatabase `has merged with `RelationalDatabase`.

* chore: Fix `gcov` lib coverage build failure on macOS (XRPLF#6350)

For coverage builds, we try to link against the `gcov` library (specific to the environment). But as macOS doesn't have this library and thus doesn't have the coverage tools to generate reports, the coverage builds on that platform were failing on linking.

We actually don't need to explicitly force this linking, as the `CodeCoverage` file already has correct detection logic (currently on lines 177-193), which is invoked when the `--coverage` flag is provided:
* AppleClang: Uses `xcrun -f llvm-cov` to set `GCOV_TOOL="llvm-cov gcov"`.
* Clang: Finds `llvm-cov` to set `GCOV_TOOL="llvm-cov gcov"`.
* GCC: Finds `gcov` to set `GCOV_TOOL="gcov"`.
The `GCOV_TOOL` is then passed to `gcovr` on line 416, so the correct tool is used for processing coverage data.

This change therefore removes the `gcov` suffix from lines 473 and 475 in the `CodeCoverage.cmake` file.

* refactor: Modularize the NetworkOPs interface (XRPLF#6225)

This change moves the NetworkOPs interface into `libxrpl` and it leaves its implementation in `xrpld`.

* chore: Fix minor issues in comments (XRPLF#6346)

* refactor: Modularize `HashRouter`, `Conditions`, and `OrderBookDB` (XRPLF#6226)

This change modularizes additional components by moving code to `libxrpl`.

* chore: Update clang-format to 21.1.8 (XRPLF#6352)

* refactor: Decouple app/tx from `Application` and `Config` (XRPLF#6227)

This change decouples app/tx from `Application` and `Config` to clear the way to moving transactors to `libxrpl`.

* refactor: Modularize app/tx (XRPLF#6228)

* ci: Add clang tidy workflow to ci (XRPLF#6369)

* chore: Set cmake-format width to 100 (XRPLF#6386)

* chore: Set clang-format width to 100 in config file (XRPLF#6387)

* chore: Apply clang-format width 100 (XRPLF#6387)

* Fix tautological assertion (XRPLF#6393)

* ci: Add dependabot config (XRPLF#6379)

* ci: Build docs in PRs and in private repos (XRPLF#6400)

* ci: [DEPENDABOT] bump codecov/codecov-action from 5.4.3 to 5.5.2 (XRPLF#6398)

Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5.4.3 to 5.5.2.
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@18283e0...671740a)

---
updated-dependencies:
- dependency-name: codecov/codecov-action
  dependency-version: 5.5.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix merge conflicts, format, spelling,
replace all include guards by #pragma once

* Fix non-unity build

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Ayaz Salikhov <mathbunnyru@users.noreply.github.com>
Co-authored-by: Ed Hennis <ed@ripple.com>
Co-authored-by: Bart <bthomee@users.noreply.github.com>
Co-authored-by: Mayukha Vadari <mvadari@ripple.com>
Co-authored-by: Vito Tumas <5780819+Tapanito@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Jingchen <a1q123456@users.noreply.github.com>
Co-authored-by: Niq Dudfield <ndudfield@gmail.com>
Co-authored-by: Valentin Balaschenko <13349202+vlntb@users.noreply.github.com>
Co-authored-by: Olek <115580134+oleks-rip@users.noreply.github.com>
Co-authored-by: Pratik Mankawde <3397372+pratikmankawde@users.noreply.github.com>
Co-authored-by: nuxtreact <nuxtreact@outlook.com>
Co-authored-by: Sergey Kuznetsov <skuznetsov@ripple.com>
Co-authored-by: Ayaz Salikhov <asalikhov@ripple.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@bthomee
Copy link
Copy Markdown
Collaborator Author

bthomee commented Mar 20, 2026

/ai-review

Copy link
Copy Markdown
Contributor

@xrplf-ai-reviewer xrplf-ai-reviewer bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two null pointer dereference risks identified in cache operations.

Review by Claude Opus 4.6 · Prompt: V12

// Also clear the FullBelowCache so its generation counter is bumped.
// This prevents stale "full below" markers from persisting across
// backend rotation/online deletion and interfering with SHAMap sync.
app_.getNodeFamily().getFullBelowCache()->clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null pointer risk: Guard cache access before clear() operation:

if (freshenCache(*treeNodeCache_))
return;
if (freshenCache(app_.getMasterTransaction().getCache()))
if (freshenCache(*app_.getNodeFamily().getTreeNodeCache()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null pointer dereference risk in cache access - add safety check:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QE test required RippleX QE Team must look at this PR. Triaged Issue/PR has been triaged for viability, liveliness, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants