Skip to content

[BUG] In-efficient approach for checking if a unassigned shard is from a batch being processed  #14532

@SwethaGuptha

Description

@SwethaGuptha

Describe the bug

Problem:

PrimaryShardBatchAllocator: At https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/gateway/PrimaryShardBatchAllocator.java#L107 List.contains check is being performed on a list of unassigned shards in batch against all unassigned shards which expensive.

Identified regression

`---ts=2024-06-25 09:25:40;thread_name=opensearch[cce76896e008802de8384c1a0cea2d6a][clusterManagerService#updateTask][T#1];id=291;is_daemon=true;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@531d72ca
    `---[19021.204849ms] org.opensearch.gateway.PrimaryShardBatchAllocator:allocateUnassignedBatch()
        +---[0.00% 0.001395ms ] java.util.List:size() #85
        +---[0.00% 9.6E-4ms ] org.apache.logging.log4j.Logger:debug() #85
        +---[0.00% 6.4E-4ms ] java.util.HashMap:<init>() #86
        +---[0.00% 5.5E-4ms ] java.util.ArrayList:<init>() #87
        +---[0.00% 5.17E-4ms ] java.util.ArrayList:<init>() #88
        +---[0.00% 5.83E-4ms ] java.util.List:iterator() #90
        +---[0.01% min=4.02E-4ms,max=7.47E-4ms,total=1.672409ms,count=4001] java.util.Iterator:hasNext() #90
        +---[0.01% min=4.34E-4ms,max=0.008025ms,total=1.825077ms,count=4000] java.util.Iterator:next() #90
        +---[0.02% min=7.46E-4ms,max=0.001255ms,total=3.144428ms,count=4000] org.opensearch.gateway.PrimaryShardBatchAllocator:getInEligibleShardDecision() #91
        +---[0.01% min=4.26E-4ms,max=0.012168ms,total=1.787079ms,count=4000] java.util.List:add() #96
        +---[0.01% 1.454972ms ] org.opensearch.gateway.PrimaryShardBatchAllocator:fetchData() #101
        +---[0.00% 9.27E-4ms ] org.opensearch.cluster.routing.allocation.RoutingAllocation:routingNodes() #103
        +---[0.00% 7.63E-4ms ] org.opensearch.cluster.routing.RoutingNodes:unassigned() #103
        +---[0.00% 0.003914ms ] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards:iterator() #103
        +---[1.13% min=5.08E-4ms,max=0.019036ms,total=214.624591ms,count=382367] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards$UnassignedIterator:hasNext() #104
        +---[1.41% min=5.74E-4ms,max=0.01646ms,total=268.380958ms,count=382366] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards$UnassignedIterator:next() #105
        +---[92.38% min=5.09E-4ms,max=12.265055ms,total=17572.545705ms,count=382366] java.util.List:contains() #108
        +---[0.01% min=5.17E-4ms,max=9.19E-4ms,total=2.21695ms,count=4000] org.opensearch.cluster.routing.ShardRouting:shardId() #110
        +---[0.01% min=4.51E-4ms,max=7.63E-4ms,total=1.905109ms,count=4000] java.util.HashMap:containsKey() #110
        +---[0.02% min=6.89E-4ms,max=0.001904ms,total=3.423754ms,count=4000] org.opensearch.gateway.PrimaryShardBatchAllocator:adaptToNodeShardStates() #113
        +---[0.03% min=0.001042ms,max=0.01842ms,total=6.486992ms,count=4000] org.opensearch.gateway.PrimaryShardBatchAllocator:getAllocationDecision() #114
        +---[0.72% min=0.001321ms,max=0.074331ms,total=137.576431ms,count=4000] org.opensearch.gateway.PrimaryShardBatchAllocator:executeDecision() #116
        +---[0.00% 6.15E-4ms ] java.util.List:size() #119
        `---[0.00% 7.06E-4ms ] org.apache.logging.log4j.Logger:debug() #119

Replica allocations:

`---ts=2024-06-25 09:37:12;thread_name=opensearch[cce76896e008802de8384c1a0cea2d6a][clusterManagerService#updateTask][T#1];id=291;is_daemon=true;priority=5;TCCL=jdk.internal.loader.ClassLoaders$AppClassLoader@531d72ca
    `---[97.588697ms] org.opensearch.gateway.ReplicaShardBatchAllocator:allocateUnassignedBatch()
        +---[0.00% 0.001641ms ] java.util.List:size() #120
        +---[0.00% 7.38E-4ms ] org.apache.logging.log4j.Logger:debug() #120
        +---[0.00% 6.15E-4ms ] java.util.ArrayList:<init>() #121
        +---[0.00% 5.17E-4ms ] java.util.ArrayList:<init>() #122
        +---[0.00% 6.73E-4ms ] java.util.HashMap:<init>() #123
        +---[0.00% 6.73E-4ms ] java.util.List:iterator() #125
        +---[1.76% min=4.1E-4ms,max=8.86E-4ms,total=1.721501ms,count=4001] java.util.Iterator:hasNext() #125
        +---[1.85% min=4.26E-4ms,max=9.52E-4ms,total=1.806896ms,count=4000] java.util.Iterator:next() #125
        +---[3.59% min=8.45E-4ms,max=0.006646ms,total=3.505268ms,count=4000] org.opensearch.gateway.ReplicaShardBatchAllocator:getUnassignedShardAllocationDecision() #126
        +---[1.93% min=4.43E-4ms,max=0.022982ms,total=1.87881ms,count=4000] java.util.List:add() #129
        +---[2.79% min=5.16E-4ms,max=0.149802ms,total=2.718945ms,count=4000] java.util.Map:put() #130
        +---[3.18% 3.104627ms ] org.opensearch.gateway.ReplicaShardBatchAllocator:fetchData() #137
        +---[0.00% 0.001855ms ] java.util.List:stream() #139
        +---[0.00% 8.12E-4ms ] java.util.stream.Stream:map() #139
        +---[0.00% 5.66E-4ms ] java.util.stream.Collectors:toList() #139
        +---[0.21% 0.201313ms ] java.util.stream.Stream:collect() #139
        +---[0.00% 7.22E-4ms ] org.opensearch.cluster.routing.allocation.RoutingAllocation:routingNodes() #140
        +---[0.00% 7.3E-4ms ] org.opensearch.cluster.routing.RoutingNodes:unassigned() #140
        +---[0.00% 0.003873ms ] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards:iterator() #140
        +---[2.20% min=5.08E-4ms,max=0.009994ms,total=2.145579ms,count=4001] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards$UnassignedIterator:hasNext() #141
        +---[2.71% min=5.83E-4ms,max=0.010535ms,total=2.6432ms,count=4000] org.opensearch.cluster.routing.RoutingNodes$UnassignedShards$UnassignedIterator:next() #142
        +---[2.08% min=4.59E-4ms,max=0.015689ms,total=2.029472ms,count=4000] org.opensearch.cluster.routing.ShardRouting:primary() #146
        +---[2.21% min=5.08E-4ms,max=0.012275ms,total=2.1536ms,count=4000] org.opensearch.cluster.routing.ShardRouting:shardId() #146
        +---[30.93% min=4.51E-4ms,max=0.033813ms,total=30.187663ms,count=4000] java.util.List:contains() #146
        +---[2.56% min=4.76E-4ms,max=0.014458ms,total=2.499516ms,count=4000] java.util.Map:containsKey() #148
        +---[2.08% min=4.75E-4ms,max=0.013211ms,total=2.026379ms,count=4000] java.util.Map:get() #149
        +---[5.86% min=9.84E-4ms,max=0.012866ms,total=5.714058ms,count=4000] org.opensearch.gateway.ReplicaShardBatchAllocator:executeDecision() #160
        +---[0.00% 7.3E-4ms ] java.util.List:size() #163
        `---[0.00% 0.001846ms ] org.apache.logging.log4j.Logger:debug() #163

Related component

Cluster Manager

To Reproduce

N/A

Expected behavior

Use HashSet data-structure instead of List for constant performance.

Additional Details

OpenSearchVersion 2.13
Please list all plugins currently enabled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Cluster ManagerbugSomething isn't workingv2.16.0Issues and PRs related to version 2.16.0

    Type

    No type

    Projects

    Status

    Now(This Quarter)

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions