Skip to content

Switch indices read-only if a node runs out of disk space#25541

Merged
s1monw merged 8 commits intoelastic:masterfrom
s1monw:issues/24299
Jul 5, 2017
Merged

Switch indices read-only if a node runs out of disk space#25541
s1monw merged 8 commits intoelastic:masterfrom
s1monw:issues/24299

Conversation

@s1monw
Copy link
Copy Markdown
Contributor

@s1monw s1monw commented Jul 4, 2017

Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to 95% utilization by default.

Closes #24299

Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in elastic#24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk
situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to `95%` utilization by default.

Closes elastic#24299
Copy link
Copy Markdown
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left a bunch of small nits.

if (usage.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdFloodStage().getBytes() ||
usage.getFreeDiskAsPercentage() < diskThresholdSettings.getFreeDiskThresholdFloodStage()) {
RoutingNode routingNode = state.getRoutingNodes().node(node);
if (routingNode != null) { // this might happen if we haven't got the full cluster-state yet?!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this happens when you have a non-data node. It's been on the hit list to remove (it's annoying imo and is a leftover of the node as client days). no one got to it yet.


protected void markIndicesReadOnly(Set<String> indicesToMarkReadOnly) {
// set read-only block but don't block on the response
client.admin().indices().prepareUpdateSettings(indicesToMarkReadOnly.toArray(Strings.EMPTY_ARRAY)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should protect for errors here and log a warning. It doesn't seem like there are protection higher up the call stacks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do catch errors in InternalClusterInfoService and log a warning so I think we are good here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - I missed it.

(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're missing the watermark in the setting path (i.e. disk.floodstage vs disk.watermark.floodstage). Feels a bit strange to me but I'm not a native speaker. @dakrone wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah so my logic was, we have 2 watermarks (high and low) and floodstage so I left the watermark out of the key?!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious what @rjernst thinks I am ok with whatever makes sense

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add DISK to this name (with or without watermark - see other comment). i.e. CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_DISK_WATERMARK or something?

injector.getInstance(RoutingService.class).start();
injector.getInstance(SearchService.class).start();
injector.getInstance(MonitorService.class).start();
nodeService.getMonitorService().start();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

toClose.add(injector.getInstance(Discovery.class));
toClose.add(() -> stopWatch.stop().start("monitor"));
toClose.add(injector.getInstance(MonitorService.class));
toClose.add(nodeService.getMonitorService());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant. It seems to be already closed by nodeService. It's a bit weird that NodeService closes the monitor service but doesn't call own calling the stop / start methods. Not a big deal though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a leftover from a different path I tried...

@s1monw
Copy link
Copy Markdown
Contributor Author

s1monw commented Jul 5, 2017

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left one super minor nit, thanks for doing this!

protected ClusterInfoService newClusterInfoService(Settings settings, ClusterService clusterService,
ThreadPool threadPool, NodeClient client) {
return new InternalClusterInfoService(settings, clusterService, threadPool, client);
ThreadPool threadPool, NodeClient client, Consumer<ClusterInfo> listeners) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: listeners -> listener

@bleskes
Copy link
Copy Markdown
Contributor

bleskes commented Jul 5, 2017

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

I think we can rely on the fact that if flood stage is on thant the high water mark is also on, which has already kicked in the reroute (as you say). I also think that flood stage is about making indices read only (and only that). A follow up reroute shard moving won't change that (as we don't automatically make the indices writable again). So +1 to keeping as is and not adding a reroute.

@dakrone
Copy link
Copy Markdown
Member

dakrone commented Jul 5, 2017

Oh something else I forgot to mention, I think we need to document this feature and the new flood level also?

Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",
(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.floodstage"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we have no validation that low <= high <= flood but I think that we should?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to do this but it should be a sep. PR it's also tricky to do since we don't have a fully picture of the new settings when they are updated. essentially I don't think we can easily protect from this today. to make it work I think we need an extra validation round when settings are updated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I open #25560 as a possible approach?

(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

Copy link
Copy Markdown
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question. I think we need docs too.

@s1monw
Copy link
Copy Markdown
Contributor Author

s1monw commented Jul 5, 2017

I was waiting for initial feedback until I doc this.. will add docs soon

@s1monw
Copy link
Copy Markdown
Contributor Author

s1monw commented Jul 5, 2017

@jasontedor @dakrone I pushed docs and requested changes

Copy link
Copy Markdown
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@s1monw s1monw merged commit 6e5cc42 into elastic:master Jul 5, 2017
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 6, 2017
* master:
  Refactor PathTrie and RestController to use a single trie for all methods (elastic#25459)
  Switch indices read-only if a node runs out of disk space (elastic#25541)
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement resiliency v6.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants