Add Caching for RepositoryData in BlobStoreRepository#52341
Add Caching for RepositoryData in BlobStoreRepository#52341original-brownbear merged 13 commits intoelastic:masterfrom original-brownbear:cache-latest-repository-data
Conversation
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
| public class MockSecureSettings implements SecureSettings { | ||
|
|
||
| private Map<String, SecureString> secureStrings = new HashMap<>(); | ||
| private Map<String, String> secureStrings = new HashMap<>(); |
There was a problem hiding this comment.
Just fixing a bug here that prevented node restarts in tests for nodes that used secure settings. If we always return the same SecureString for getString then if a node ever closes that SecureString it couldn't get the setting again.
| containsString("Could not read repository data because the contents of the repository do not match its expected state.")); | ||
| } | ||
|
|
||
| private void fullRestart() throws Exception { |
There was a problem hiding this comment.
This is admittedly pretty dirty now test-wise. Obviously, it also illustrates that this change slightly weakens our ability to detect concurrent repository writes from multiple clusters.
IMO, this is a fair trade-off though and one we've already been making over and over in other spots (anytime we optimised away a loading of RepositoryData we weakened the ability to detect concurrent load obviously).
|
|
||
| @Override | ||
| public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | ||
| cacheRepositoryData(filteredRepositoryData.withGenId(newGen)); |
There was a problem hiding this comment.
This is a little dirty and we talked about it in other PRs .. the generation here is -1 due to the weird way the repository data is loaded initially and we have to eventually set it. I'll clean that up in a follow up, for now just setting it here where it matters (it doesn't matter in the above code when serializing filteredRepositoryData to the repo) seemed the driest.
There was a problem hiding this comment.
Not sure what to even put in it though. In the end, this is simply the way RepositoryData works for now. We have the same kind of code for the cluster state version as well I guess. I don't have a good idea for a better abstract yet :(
| public IndexId(final String name, final String id) { | ||
| this.name = name; | ||
| this.id = id; | ||
| this.name = name.intern(); |
There was a problem hiding this comment.
Moving to interning here and for the snapshot id actually makes the on heap RepositoryData about the same size as the serialized one (unfortunately for the time being we serialize it as json uncompressed) => I think the safety check for 500kb I added is valid.
There was a problem hiding this comment.
I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).
| try { | ||
| final int len = | ||
| BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length(); | ||
| if (len > ByteSizeUnit.KB.toBytes(500)) { |
There was a problem hiding this comment.
For context:
on heap 1k shards in a snapshot means about 25kb of shard generations which themselves make up ~50% of the size of RepositoryData on heap with the interning changes I added.
The number of snapshots doesn't matter much by comparison. So in the Cloud case of 100 snapshots, this would allow for caching snapshots of up to ~ 15k shards which should work fine for most users I'm assuming.
| public IndexId(final String name, final String id) { | ||
| this.name = name; | ||
| this.id = id; | ||
| this.name = name.intern(); |
There was a problem hiding this comment.
I would prefer to avoid JVM String interning. If we can deduplicate the Strings ourselves, that's fine, but adding more and more stuff to be internalized feels dangerous (another source of OOM).
| private RepositoryData safeRepositoryData(long repositoryStateId, Map<String, BlobMetaData> rootBlobs) { | ||
| final long generation = latestGeneration(rootBlobs.keySet()); | ||
| final long genToLoad; | ||
| final RepositoryData cached; |
There was a problem hiding this comment.
maybe we can just cache the serialized (and compressed) bytes instead of the object. Decompressing should still be fast.
There was a problem hiding this comment.
Yea I experimented a little with this and it's really sweet so I moved to that now. The compression ration is massive (more than a factor of 10x for 100 snapshots with 1k shards each) and I can hardly imagine a repo that could break 500kb for the cached size (100 snapshots of 1k shards is only ~20kb and adding additional snapshots is cheap as well since each new snapshot effectively just adds that snapshots uuid + name if it doesn't contain new shards).
| return; | ||
| } | ||
| } catch (IOException e) { | ||
| throw new AssertionError("Impossible, no IO happens here", e); |
There was a problem hiding this comment.
If the node can't serialize the RepositoryData, it would die here. Perhaps just catch and log (while having an assert as well for our tests)
There was a problem hiding this comment.
Sure thing, though I figured this is actually impossible because the same code must have serialized that repository data to even make it visible here (since we only cache what we previously wrote to the repo).
| if (bestEffortConsistency == false) { | ||
| try { | ||
| final int len = | ||
| BytesReference.bytes(updated.snapshotsToXContent(XContentFactory.jsonBuilder(), true)).length(); |
There was a problem hiding this comment.
Maybe make RepositoryData implements Accountable? Should we craft an estimating function instead of serializing the whole RepositoryData back again? If we were serializing back to XContent we could pass a FilterOutputStream to the XContentBuilder so that it just count bytes and not build everything in heap.
Or if we follow Yannick's suggestion on caching the serialized and compressed bytes then maybe we should keep track of the length of the original blob.
There was a problem hiding this comment.
Jup, went with Yannicks approach now :)
|
|
||
| @Override | ||
| public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) { | ||
| cacheRepositoryData(filteredRepositoryData.withGenId(newGen)); |
ywelsch
left a comment
There was a problem hiding this comment.
A few more small comments. Looking very good already though.
| } | ||
| }))); | ||
|
|
||
| // Master failover to clear RepositoryData cache |
There was a problem hiding this comment.
I think I would prefer an explicit undocumented setting to disable the cache. This might also turn out to be useful if we see any issues with this new functionality
There was a problem hiding this comment.
Yea that's much nicer indeed :) added that setting now and used it in tests.
| throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" + | ||
| repositoryStateId + "], actual current generation [" + genToLoad + "]"); | ||
| } | ||
| if (cached != null && cached.v1() == genToLoad && cached.v2() != null) { |
There was a problem hiding this comment.
Should cached.v2() always be non-null if cached != null?
There was a problem hiding this comment.
Jup now it is.
|
|
||
| // Best effort cache of the latest known repository data and its generation, cached serialized as compressed json | ||
| private final AtomicReference<Tuple<Long, BytesReference>> latestKnownRepositoryData = | ||
| new AtomicReference<>(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null)); |
There was a problem hiding this comment.
perhaps just initialize to null?
| " repository behavior going forward.", metadata.name()); | ||
| } | ||
| // Set empty repository data to not waste heap for an outdated cached value | ||
| latestKnownRepositoryData.set(new Tuple<>(RepositoryData.EMPTY_REPO_GEN, null)); |
| } | ||
|
|
||
| private RepositoryData repositoryDataFromCachedEntry(Tuple<Long, BytesReference> cacheEntry) throws IOException { | ||
| if (cacheEntry.v1() == RepositoryData.EMPTY_REPO_GEN) { |
There was a problem hiding this comment.
I think this optimization is unnecessary, and complicating the checks on null (see my suggestions above)
|
Thanks Yannick, all addressed in 533b4f8 now I hope :) |
|
Thanks Yannick + Tanguy! |
Cache latest `RepositoryData` on heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode). `RepositoryData` can safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO. The benefits of this move are: * Much faster repository status API calls * listing all snapshot names becomes instant * Other operations are sped up massively too because they mostly operate in two steps: load repository data then load multiple other blobs to get the additional data * Additional cloud cost savings * Better resiliency, saving another spot where an IO issue could break the snapshot * We can simplify a number of spots in the current code that currently pass around the repository data in tricky ways to avoid loading it multiple times in follow ups.
Cache latest
RepositoryDataon heap when it's absolutely safe to do so (i.e. when the repository is in strictly consistent mode).RepositoryDatacan safely be assumed to not grow to a size that would cause trouble because we often have at least two copies of it loaded at the same time when doing repository operations. Also, concurrent snapshot API status requests currently load it independently of each other and so on, making it safe to cache on heap and assume as "small" IMO.The benefits of this move are:
I know we are thinking about caching other repository metadata in an internal index, but I think this blob is better served from heap since it's so performance critical (this is irrelevant now to some degree but more interesting for concurrent repository operations).