Deduplicate mappings by probakowski · Pull Request #69772 · elastic/elasticsearch

probakowski · 2021-03-02T09:23:46Z

This change modifies cluster state to avoid duplication of mappings between indices. This helps when there are many indices sharing the same big mappings. This should lower memory consumption and speed up cluster state updates as there will be less things to write each time. Deduplication is done on 2 levels:

on runtime by adding cache to org.elasticsearch.cluster.metadata.Metadata.Builder which makes sure that the same mappings use the same instance
when saving cluster state to disk we store mappings metadata separately from index metadata (where we only store mapping id)

Using following test on master vs this branch there's 2.34x reduction in cluster state size (1508kB vs 644kB):

curl https://gist.githubusercontent.com/probakowski/f14623080d7e28e74e572f564db5bff3/raw/583b32091985a315304ad1cf7564242a82273973/auditbeat.json > auditbeat.json
curl -XPUT  -u elastic-admin:elastic-password 'http://localhost:9200/_template/auditbeat-7.8.0'  --header 'Content-Type: application/json' --data-binary "@auditbeat.json"
for i in $(seq 1 500); do curl -XPUT -u elastic-admin:elastic-password http://localhost:9200/auditbeat-$i; done

Possible changes/improvements:

store mappings cache in Metadata instead of building it in Metadata.Builder so we don't have to rebuild it every time (it's rather quick process though)
move id to CompressedXContent so we can deduplicate mappings in IndexMetadata and IndexTemplateMetadata (not a big gain if there are many indices using the same template)
deduplicate settings - the same idea may be applied to the index settings which should generate another significant reduction in cluster state size

probakowski · 2021-03-02T12:09:53Z

@elasticmachine update branch

DaveCTurner

Just a few small comments on the cluster state persistence changes

DaveCTurner · 2021-03-02T11:11:50Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

    private static final String DATA_FIELD_NAME = "data";
    private static final String GLOBAL_TYPE_NAME = "global";
    private static final String INDEX_TYPE_NAME = "index";
+    private static final String MAPPING_TYPE_NAME = "mapping";


Please remember to update the comment at the top describing the new schema :)

DaveCTurner · 2021-03-02T11:12:23Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

+
+        Map<String, MappingMetadata> mappings = new HashMap<>();
+        consumeFromType(searcher, MAPPING_TYPE_NAME, bytes -> {
+            MappingMetadata mappingMetadata = MappingMetadata.fromXContent(XContentFactory.xContent(XContentType.SMILE)


Could we include trace logging for loading each mapping too?

DaveCTurner · 2021-03-02T11:13:53Z

server/src/main/java/org/elasticsearch/gateway/PersistedClusterStateService.java

            if (indexUUIDs.add(indexMetadata.getIndexUUID()) == false) {
                throw new IllegalStateException("duplicate metadata found for " + indexMetadata.getIndex() + " in [" + dataPath + "]");
            }
+            if (indexMetadata.mapping() != null && mappings.containsKey(indexMetadata.mapping().id())) {


Can we enforce some stronger invariant here? E.g. if the mapping has an ID then it must be in mappings? Maybe also that the rest of the mapping serialized with the index is empty?

nik9000 · 2021-03-05T16:47:47Z

This change modifies cluster state to avoid duplication of mappings between indices.

It looks like this transparently detects when the mappings are the same. From the "outside" this is invisible. This makes me quite happy.

Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from #69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to #77466

Backporting elastic#80348 to 8.0 branch. Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from elastic#69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to elastic#77466

Backporting #80348 to 8.0 branch. Hash the mapping source of a MappingMetadata instance and then cache it in Metadata class. A mapping with the same hash will use a cached MappingMetadata instance. This can significantly reduce the number of MappingMetadata instances for data streams and index patterns. Idea originated from #69772, but just focusses on the jvm heap memory savings. And hashes the mapping instead of assigning it an uuid. Relates to #77466

probakowski added 9 commits February 23, 2021 12:13

dedup mappings

b01bdb2

id -> string

035ef22

fix delete

fe452bc

cleanup

f8d9186

cleanup

9679c40

fix mapping removal

761d9f9

fix NPE

091c318

fix loading

a345bc0

fix error

12f74d2

Merge branch 'master' into dedup_mappings

2ea22aa

DaveCTurner reviewed Mar 2, 2021

View reviewed changes

martijnvg mentioned this pull request Nov 5, 2021

Reuse MappingMetadata instances in Metadata class. #80348

Merged

martijnvg mentioned this pull request Nov 25, 2021

[8.0] Reuse MappingMetadata instances in Metadata class. #81036

Merged

probakowski closed this Jan 17, 2022

probakowski deleted the dedup_mappings branch January 17, 2022 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate mappings#69772

Deduplicate mappings#69772
probakowski wants to merge 10 commits intoelastic:masterfrom
probakowski:dedup_mappings

probakowski commented Mar 2, 2021

Uh oh!

probakowski commented Mar 2, 2021

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Mar 2, 2021

Uh oh!

DaveCTurner Mar 2, 2021

Uh oh!

DaveCTurner Mar 2, 2021

Uh oh!

nik9000 commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

probakowski commented Mar 2, 2021

Uh oh!

probakowski commented Mar 2, 2021

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Mar 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants