Skip to content

Reuse MappingMetadata instances in Metadata class.#80348

Merged
martijnvg merged 33 commits intoelastic:masterfrom
martijnvg:reuse_mapping_metadata
Nov 25, 2021
Merged

Reuse MappingMetadata instances in Metadata class.#80348
martijnvg merged 33 commits intoelastic:masterfrom
martijnvg:reuse_mapping_metadata

Conversation

@martijnvg
Copy link
Copy Markdown
Member

@martijnvg martijnvg commented Nov 4, 2021

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from #69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to #77466

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Relates to elastic#77466
@martijnvg
Copy link
Copy Markdown
Member Author

Locally I have been testing this improvement with the script at the end of this comment. This script creates a data stream and rolls it over concurrently (1000 rollovers). With this change, the number of MappingMetadata instances is 3 (running the script in en empty cluster) and without this change over 2000 instances.

import threading
from elasticsearch import Elasticsearch

ds_base_name = 'logs-mysql-'
es = Elasticsearch(['http://localhost:9200'], http_auth=('elastic-admin', 'elastic-password'), timeout=600)

def rollover_test(ds_name):
    es.indices.delete_data_stream(name=ds_name, ignore=[400, 404])
    es.indices.create_data_stream(name=ds_name)
    for i in range(200):
        es.indices.rollover(alias=ds_name,wait_for_active_shards=0)


if __name__ == "__main__":
    es.cluster.put_settings(body={"persistent": {"cluster.max_shards_per_node": 100000}})
    es.indices.put_index_template(name='my-template', body={
        "index_patterns": ["logs-*-*"],
        "priority": 200,
        "data_stream": {},
        "composed_of": [
            "logs-mappings",
            "data-streams-mappings"
        ],
        "template": {
            "settings": {
                "index.number_of_replicas": 0
            }
        },
        "allow_auto_create": True
    })

    for i in range(5):
        t = threading.Thread(target=rollover_test, args=(ds_base_name + str(i),))
        t.start()

@martijnvg martijnvg force-pushed the reuse_mapping_metadata branch from c5b17aa to bbd688a Compare November 5, 2021 16:44
@martijnvg martijnvg marked this pull request as ready for review November 8, 2021 08:24
@martijnvg martijnvg added :Data Management/Indices APIs DO NOT USE. Use ":Distributed/Indices APIs" or ":StorageEngine/Templates" instead. v8.0.0 labels Nov 8, 2021
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Nov 8, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@martijnvg martijnvg added >enhancement and removed Team:Data Management (obsolete) DO NOT USE. This team no longer exists. labels Nov 8, 2021
}
}

private void cleanupUnusedEntry(IndexMetadata previous) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the cache is passed down to new Metadata instances, a cleanup mechanism is needed.
This approach with counting the times a MappignMetata instance is used and removing entries when there are no usages seems cheaper compared to checking each metadata is build whether all entries are used and then removed the unused entries. But this approach is also more complex. I'm also okay to change the cleanup logic to be simpler and checking each cache entry whether it is used in any of the indices and if unused remove.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the simpler version doesn't come with problematic runtime overhead it might be fine? It seems like it should be cheap now that we can cheaply compare mappings?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is true. Also I find this implementation a bit fragile. I will change the cleanup implementation to just check for each cache entry whether it is used by any IndexMetadata.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented: 15b05af

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Martijn, this looks really nice and should be pretty impactful. I just have a couple of questions, suggestions on the plumbing around this mostly :) Happy to discuss this on another channel if that's quicker as well.

String mappingAsString = Strings.toString((builder, params) -> builder.mapContents(mapping));
try {
this.source = new CompressedXContent((builder, params) -> builder.mapContents(mapping));
this.source = new CompressedXContent(mappingAsString);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use this.source = new CompressedXContent((builder, params) -> builder.mapContents(mapping)); here still and compute the digest as we compress for example since we stream the bytes in that case anyway?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that note, I wonder if we should just make the Sha256 value a part of the CompressedXContent? I see how we do some magic here to make sure we build a deterministic version based on a map, but we could have that by just adding a static constructor to this thing? For BwC we can compute the CRC or SHA1 dynamically where/if needed I guess.
Also, that would allow us to make the equals in CompressedXContent almost free by just turning it into a string comparison :) WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea and in fact I did think about this while working on this. I will try this out and see whether it is not too difficult.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the first part of this: c770523
I will work on making the sha256 really part of of CompressedXContent, just like crc32 checksum.
Just to double check, do we want to keep the crc32 checksum? I think so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, do we want to keep the crc32 checksum? I think so.

Not sure tbh. It seems we're only using it to speed up comparisons and as a hashCode. Both of which we can have from the sha256 value can't we?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I thought there was another usage of the crc32, but it only used for hashcode and checking consistency in assertions. Both can be done with sha256 too. I will replace crc32 completely with sha256.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed: 1961453

Map<String, Object> routingNode = (Map<String, Object>) withoutType.get("_routing");
for (Map.Entry<String, Object> entry : routingNode.entrySet()) {
String fieldName = entry.getKey();
Map<?, ?> routingNode = (Map<?, ?>) withoutType.get("_routing");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we have String keys here anymore?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit annoyed with the SuppressWarnings("unchecked") and so removed it. I can undo this change and make this change in another pr.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, this one is a little too impactful to have additional noise in it I think :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will undo this bit here in the pr.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this change: de25c02

}
}

private void cleanupUnusedEntry(IndexMetadata previous) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the simpler version doesn't come with problematic runtime overhead it might be fine? It seems like it should be cheap now that we can cheaply compare mappings?

@@ -43,43 +43,43 @@ public final class CompressedXContent {
private static final ThreadLocal<InflaterAndBuffer> inflater1 = ThreadLocal.withInitial(InflaterAndBuffer::new);
private static final ThreadLocal<InflaterAndBuffer> inflater2 = ThreadLocal.withInitial(InflaterAndBuffer::new);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg I just realized, we can probably remove all of this infrastructure here now that we have efficient equals? I added this only to speed up comparison a while back, but now it seems irrelevant and can be simplified?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've removed inflater2 and equalsWhenUncompressed() method: c05791c

inflater1 and a few other things are still in use by production code.

@martijnvg
Copy link
Copy Markdown
Member Author

@elasticmachine run elasticsearch-ci/part-2

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think I'd like the index-metadata constructor to be set up differently just to enable follow-ups but it's your call, I can deal with that myself in a follow-up when I work on it again :)
Thanks so much for the iterations on this one!

assert numberOfShards * routingFactor == routingNumShards : routingNumShards + " must be a multiple of " + numberOfShards;
}

IndexMetadata(IndexMetadata indexMetadata, MappingMetadata mapping) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe make this just a method:

IndexMetadata withMappingMetadata(MappingMetadata mapping) to give us an obvious copy constructor. I also wonder, even though it's slightly more code, maybe it's better if this would just invoke the existing constructor so we don't have two constructors to maintain for this thing (given how additional fields are incoming from settings moving to fields in this class)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I agree this is cleaner. I've made this change: 7d09282


private void purgeMappingCache(ImmutableOpenMap<String, IndexMetadata> indices) {
final var iterator = cache.entrySet().iterator();
while (iterator.hasNext()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ perfect

@martijnvg martijnvg merged commit c67b470 into elastic:master Nov 25, 2021
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
8.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 80348

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Nov 25, 2021
Backporting elastic#80348 to 8.0 branch.

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from elastic#69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to elastic#77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Nov 25, 2021
martijnvg added a commit that referenced this pull request Nov 25, 2021
Backporting #80348 to 8.0 branch.

Hash the mapping source of a MappingMetadata instance and then
cache it in Metadata class. A mapping with the same hash
will use a cached MappingMetadata instance. This can
significantly reduce the number of MappingMetadata instances
for data streams and index patterns.

Idea originated from #69772, but just focusses on the jvm heap memory savings.
And hashes the mapping instead of assigning it an uuid.

Relates to #77466
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/Indices APIs DO NOT USE. Use ":Distributed/Indices APIs" or ":StorageEngine/Templates" instead. >enhancement v8.0.0-rc1 v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants