Skip to content

Handle negative free disk space in deciders#48392

Merged
DaveCTurner merged 5 commits intoelastic:masterfrom
DaveCTurner:2019-10-23-handle-negative-free-space
Oct 23, 2019
Merged

Handle negative free disk space in deciders#48392
DaveCTurner merged 5 commits intoelastic:masterfrom
DaveCTurner:2019-10-23-handle-negative-free-space

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.

The DiskThresholdDecider treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.

Fixes #48380

Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.

The `DiskThresholdDecider` treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.

Fixes elastic#48380
@DaveCTurner DaveCTurner added >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 v7.6.0 v7.4.2 v6.8.5 labels Oct 23, 2019
@DaveCTurner DaveCTurner requested a review from ywelsch October 23, 2019 12:44
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether we would like DiskUsage.freeBytes() to ever return a negative value. Its serialization as a VLong e.g. suggests that that value can never be negative. I know that we only expect this to be negative if we have locally create a DiskUsage instance that is not serialized, but that seems trappy either way. I wonder if we should instead add a new class that represents a DiskUsage object with relocations taken into account and use that object for the decision making and logging here. We can do this in a follow-up though.

@DaveCTurner
Copy link
Copy Markdown
Member Author

DiskUsageTests mention cases where the disk usage is negative, although looking at ESFileStore I don't think that this can really happen any more. But you're certainly right that negative disk usage stats can't be serialised and this seems bad.

I added a new class in 9ea918f.

@DaveCTurner DaveCTurner requested a review from ywelsch October 23, 2019 14:36
@ywelsch
Copy link
Copy Markdown
Contributor

ywelsch commented Oct 23, 2019

DiskUsageTests mention cases where the disk usage is negative, although looking at ESFileStore I don't think that this can really happen any more.

Should we fix the tests then?

I would like to have these assertions in DiskUsage checking that values >= 0?

long getFreeBytes() {
return diskUsage.getFreeBytes() - relocatingShardSize;
try {
return Math.subtractExact(diskUsage.getFreeBytes(), relocatingShardSize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could this overflow?
Isn't relocatingShardSize >= 0 and 0 <= diskUsage.getFreeBytes() <= Long.MAX_VALUE? Is this to capture the case where diskUsage.getFreeBytes() == 0 && relocatingShardSize == Long.MAX_VALUE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, relocatingShardSize is the net size of all relocations, i.e. shards in minus shards out, so can indeed be negative :(

@DaveCTurner
Copy link
Copy Markdown
Member Author

I opened #48413 to investigate other sources of negative sizes because it needs a deeper look than I can offer right now.

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 36b03a2 into elastic:master Oct 23, 2019
@DaveCTurner DaveCTurner deleted the 2019-10-23-handle-negative-free-space branch October 23, 2019 15:58
DaveCTurner added a commit that referenced this pull request Oct 23, 2019
Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.

The `DiskThresholdDecider` treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.

Fixes #48380
DaveCTurner added a commit that referenced this pull request Oct 23, 2019
Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.

The `DiskThresholdDecider` treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.

Fixes #48380
DaveCTurner added a commit that referenced this pull request Oct 23, 2019
Today it is possible that the total size of all relocating shards exceeds the
total amount of free disk space. For instance, this may be caused by another
user of the same disk increasing their disk usage, or may be due to how
Elasticsearch double-counts relocations that are nearly complete particularly
if there are many concurrent relocations in progress.

The `DiskThresholdDecider` treats negative free space similarly to zero free
space, but it then fails when rendering the messages that explain its decision.
This commit fixes its handling of negative free space.

Fixes #48380
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 23, 2019
In elastic#48392 we added a second computation of the sizes of the relocating shards
in `canRemain()` but passed the wrong value for `subtractLeavingShards`. This
fixes that. It also removes some unnecessary logging in a test case added in
the same commit.
DaveCTurner added a commit that referenced this pull request Oct 24, 2019
In #48392 we added a second computation of the sizes of the relocating shards
in `canRemain()` but passed the wrong value for `subtractLeavingShards`. This
fixes that. It also removes some unnecessary logging in a test case added in
the same commit.
DaveCTurner added a commit that referenced this pull request Oct 24, 2019
In #48392 we added a second computation of the sizes of the relocating shards
in `canRemain()` but passed the wrong value for `subtractLeavingShards`. This
fixes that. It also removes some unnecessary logging in a test case added in
the same commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.8.5 v7.5.0 v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IllegalArgumentException: Values less than -1 bytes are not supported on DiskThresholdDecider

4 participants