Skip to content

Autoscaling reactive storage decider#65520

Merged
henningandersen merged 40 commits intoelastic:masterfrom
henningandersen:enhance_reactive_storage_autoscaler_pr_final
Dec 13, 2020
Merged

Autoscaling reactive storage decider#65520
henningandersen merged 40 commits intoelastic:masterfrom
henningandersen:enhance_reactive_storage_autoscaler_pr_final

Conversation

@henningandersen
Copy link
Copy Markdown
Contributor

The reactive storage decider will request additional capacity
proportional to the size of shards that are either:

  • unassigned and unable to be allocated with only reason being storage
    on a node
  • shards that cannot remain where they are with only reason being
    storage and cannot be allocated anywhere else
  • shards that cannot remain where they are and cannot be allocated
    on any node and at least one node has storage as the only reason for
    unable to allocate.

The reactive storage decider does not try to look into the future, thus
at the time the reactive decider asks to scale up, the cluster is
already in a need for more storage.

The reactive storage decider will request additional capacity
proportional to the size of shards that are either:
* unassigned and unable to be allocated with only reason being storage
on a node
* shards that cannot remain where they are with only reason being
storage and cannot be allocated anywhere else
* shards that cannot remain where they are and cannot be allocated
on any node and at least one node has storage as the only reason for
unable to allocate.

The reactive storage decider does not try to look into the future, thus
at the time the reactive decider asks to scale up, the cluster is
already in a need for more storage.
@henningandersen henningandersen added >non-issue v8.0.0 :Distributed/Autoscaling Automatically adding or removing nodes in a cluster v7.11.0 labels Nov 25, 2020
Extracted DiskUsageIntegTestCase from DiskThresholdDeciderIT to allow
other tests to easily test functionality relying on disk usage.

Relates elastic#65520
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Nov 26, 2020
Extracted DiskUsageIntegTestCase from DiskThresholdDeciderIT to allow
other tests to easily test functionality relying on disk usage.

Relates elastic#65520
allocationDeciders are now given to service at construction time.
Few test fixes.
remove context.roles()
fix unmovable test.
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Dec 4, 2020
Extracted DiskUsageIntegTestCase from DiskThresholdDeciderIT to allow
other tests to easily test functionality relying on disk usage.

Relates elastic#65520
henningandersen added a commit that referenced this pull request Dec 4, 2020
Extracted DiskUsageIntegTestCase from DiskThresholdDeciderIT to allow
other tests to easily test functionality relying on disk usage.

Relates #65520
@henningandersen
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@henningandersen
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

Copy link
Copy Markdown
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work. I left a few minor comments, but no need for another round.


ClusterInfo info();

SnapshotShardSizeInfo snapshotShardSizeInfo();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add Javadocs to these new methods, and also state?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, 22456be

DataTier.DATA_CONTENT_NODE_ROLE,
DataTier.DATA_HOT_NODE_ROLE,
DataTier.DATA_WARM_NODE_ROLE,
DataTier.DATA_COLD_NODE_ROLE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want a test that collects all the roles that return true for DiscoveryNodeRole#canContainData and ensure they are returned in this list. I'm thinking of when we add a role for frozen, ensuring that this list is maintained properly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, 555991a

}

static boolean isDiskOnlyNoDecision(Decision decision) {
// we consider throttling==yes, throttling should be temporary.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

*/
private boolean cannotAllocateDueToStorage(ShardRouting shard, RoutingAllocation allocation) {
assert allocation.debugDecision() == false;
allocation.debugDecision(true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment explaining why we need to enable allocation debugging here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, 6a9c5cb

assert assigned >= 0;
assert unassigned >= 0;
assert maxShard >= 0;
String message = unassigned > 0 || assigned > 0 ? "not enough storage available, needs " + (unassigned + assigned) : "storage ok";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be human readable bytes? So new ByteSizeValue(unassigned + assigned).toString()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if not, bytes should be appended to the message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, b58d294

@henningandersen henningandersen merged commit 5e20c0a into elastic:master Dec 13, 2020
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Dec 13, 2020
* elastic/master:
  Autoscaling reactive storage decider (elastic#65520)
  Fix TranslogTests#testStats (elastic#66227)
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Dec 13, 2020
The reactive storage decider will request additional capacity
proportional to the size of shards that are either:
* unassigned and unable to be allocated with only reason being storage
on a node
* shards that cannot remain where they are with only reason being
storage and cannot be allocated anywhere else
* shards that cannot remain where they are and cannot be allocated
on any node and at least one node has storage as the only reason for
unable to allocate.

The reactive storage decider does not try to look into the future, thus
at the time the reactive decider asks to scale up, the cluster is
already in a need for more storage.
henningandersen added a commit that referenced this pull request Dec 14, 2020
The reactive storage decider will request additional capacity
proportional to the size of shards that are either:
* unassigned and unable to be allocated with only reason being storage
on a node
* shards that cannot remain where they are with only reason being
storage and cannot be allocated anywhere else
* shards that cannot remain where they are and cannot be allocated
on any node and at least one node has storage as the only reason for
unable to allocate.

The reactive storage decider does not try to look into the future, thus
at the time the reactive decider asks to scale up, the cluster is
already in a need for more storage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Autoscaling Automatically adding or removing nodes in a cluster >non-issue Team:Distributed Meta label for distributed team. v7.11.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants