Switch indices read-only if a node runs out of disk space by s1monw · Pull Request #25541 · elastic/elasticsearch

s1monw · 2017-07-04T16:50:48Z

Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to 95% utilization by default.

Closes #24299

Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in elastic#24299 this change monitors disk utiliation and adds a floodstage watermark that causes all indices that are allocated on a node hitting the floodstage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The floodstage watermark is set to `95%` utilization by default. Closes elastic#24299

bleskes

LGTM. Left a bunch of small nits.

bleskes · 2017-07-05T06:30:13Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

+                if (usage.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdFloodStage().getBytes() ||
+                    usage.getFreeDiskAsPercentage() < diskThresholdSettings.getFreeDiskThresholdFloodStage()) {
+                    RoutingNode routingNode = state.getRoutingNodes().node(node);
+                    if (routingNode != null) { // this might happen if we haven't got the full cluster-state yet?!


Sadly this happens when you have a non-data node. It's been on the hit list to remove (it's annoying imo and is a leftover of the node as client days). no one got to it yet.

bleskes · 2017-07-05T06:33:02Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitor.java

+
+    protected void markIndicesReadOnly(Set<String> indicesToMarkReadOnly) {
+        // set read-only block but don't block on the response
+        client.admin().indices().prepareUpdateSettings(indicesToMarkReadOnly.toArray(Strings.EMPTY_ARRAY)).


I think we should protect for errors here and log a warning. It doesn't seem like there are protection higher up the call stacks.

we do catch errors in InternalClusterInfoService and log a warning so I think we are good here?

You're right - I missed it.

bleskes · 2017-07-05T06:36:07Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdSettings.java

            (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
            Setting.Property.Dynamic, Setting.Property.NodeScope);
+    public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
+        new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",


we're missing the watermark in the setting path (i.e. disk.floodstage vs disk.watermark.floodstage). Feels a bit strange to me but I'm not a native speaker. @dakrone wdyt?

yeah so my logic was, we have 2 watermarks (high and low) and floodstage so I left the watermark out of the key?!

I am curious what @rjernst thinks I am ok with whatever makes sense

I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

bleskes · 2017-07-05T06:37:58Z

core/src/main/java/org/elasticsearch/common/settings/ClusterSettings.java

                    ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING,
                    DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING,
                    DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING,
+                    DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING,


can we add DISK to this name (with or without watermark - see other comment). i.e. CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_DISK_WATERMARK or something?

bleskes · 2017-07-05T06:39:33Z

core/src/main/java/org/elasticsearch/node/Node.java

        injector.getInstance(RoutingService.class).start();
        injector.getInstance(SearchService.class).start();
-        injector.getInstance(MonitorService.class).start();
+        nodeService.getMonitorService().start();


bleskes · 2017-07-05T06:44:16Z

core/src/main/java/org/elasticsearch/node/Node.java

        toClose.add(injector.getInstance(Discovery.class));
        toClose.add(() -> stopWatch.stop().start("monitor"));
-        toClose.add(injector.getInstance(MonitorService.class));
+        toClose.add(nodeService.getMonitorService());


this is redundant. It seems to be already closed by nodeService. It's a bit weird that NodeService closes the monitor service but doesn't call own calling the stop / start methods. Not a big deal though.

that's a leftover from a different path I tried...

s1monw · 2017-07-05T13:06:46Z

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

dakrone

LGTM, I left one super minor nit, thanks for doing this!

dakrone · 2017-07-05T16:33:45Z

core/src/main/java/org/elasticsearch/node/Node.java

    protected ClusterInfoService newClusterInfoService(Settings settings, ClusterService clusterService,
-                                                       ThreadPool threadPool, NodeClient client) {
-        return new InternalClusterInfoService(settings, clusterService, threadPool, client);
+                                                       ThreadPool threadPool, NodeClient client, Consumer<ClusterInfo> listeners) {


Minor nit: listeners -> listener

bleskes · 2017-07-05T16:59:44Z

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

I think we can rely on the fact that if flood stage is on thant the high water mark is also on, which has already kicked in the reroute (as you say). I also think that flood stage is about making indices read only (and only that). A follow up reroute shard moving won't change that (as we don't automatically make the indices writable again). So +1 to keeping as is and not adding a reroute.

dakrone · 2017-07-05T17:39:24Z

Oh something else I forgot to mention, I think we need to document this feature and the new flood level also?

jasontedor · 2017-07-05T17:40:49Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdSettings.java

            Setting.Property.Dynamic, Setting.Property.NodeScope);
+    public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_SETTING =
+        new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",
+            (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.floodstage"),


It seems that we have no validation that low <= high <= flood but I think that we should?

I am happy to do this but it should be a sep. PR it's also tricky to do since we don't have a fully picture of the new settings when they are updated. essentially I don't think we can easily protect from this today. to make it work I think we need an extra validation round when settings are updated

For this I open #25560 as a possible approach?

jasontedor · 2017-07-05T17:42:13Z

core/src/main/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdSettings.java

            (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
            Setting.Property.Dynamic, Setting.Property.NodeScope);
+    public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
+        new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",


I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

jasontedor

I left a question. I think we need docs too.

s1monw · 2017-07-05T17:51:50Z

I was waiting for initial feedback until I doc this.. will add docs soon

s1monw · 2017-07-05T18:49:44Z

@jasontedor @dakrone I pushed docs and requested changes

jasontedor

LGTM.

* master: Refactor PathTrie and RestController to use a single trie for all methods (elastic#25459) Switch indices read-only if a node runs out of disk space (elastic#25541)

s1monw added >enhancement resiliency v6.0.0 labels Jul 4, 2017

s1monw requested review from bleskes and dakrone July 4, 2017 16:50

unguice MonitorService

ae1c261

bleskes approved these changes Jul 5, 2017

View reviewed changes

s1monw added 4 commits July 5, 2017 08:52

don't close monitor service in node service

96456c9

fix settings constant

df0e345

fix test to not go into flood stage

5f19c1b

Merge branch 'master' into issues/24299

bcb0b61

dakrone approved these changes Jul 5, 2017

View reviewed changes

jasontedor reviewed Jul 5, 2017

View reviewed changes

jasontedor requested changes Jul 5, 2017

View reviewed changes

s1monw added 2 commits July 5, 2017 20:24

apply feedback and add docs

810781d

fix example

6b68368

jasontedor mentioned this pull request Jul 5, 2017

Enable cross-setting validation #25560

Merged

jasontedor approved these changes Jul 5, 2017

View reviewed changes

s1monw merged commit 6e5cc42 into elastic:master Jul 5, 2017

clintongormley added the :Cluster label Jul 6, 2017

jasontedor mentioned this pull request Jul 11, 2017

Add an underscore to flood stage setting #25659

Merged

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

This was referenced Mar 14, 2018

Segment merges should try to avoid exhausting all free disk space #15481

Closed

Entries repeatedly logged with sub-second frequency filling up disk space #19164

Closed

Conversation

s1monw commented Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jul 5, 2017

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 5, 2017

Uh oh!

dakrone commented Jul 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jul 5, 2017

Uh oh!

s1monw commented Jul 5, 2017

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

s1monw commented Jul 4, 2017 •

edited

Loading