Switch indices read-only if a node runs out of disk space#25541
Switch indices read-only if a node runs out of disk space#25541s1monw merged 8 commits intoelastic:masterfrom
Conversation
Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in elastic#24299 this change monitors disk utiliation and adds a floodstage watermark that causes all indices that are allocated on a node hitting the floodstage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The floodstage watermark is set to `95%` utilization by default. Closes elastic#24299
bleskes
left a comment
There was a problem hiding this comment.
LGTM. Left a bunch of small nits.
| if (usage.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdFloodStage().getBytes() || | ||
| usage.getFreeDiskAsPercentage() < diskThresholdSettings.getFreeDiskThresholdFloodStage()) { | ||
| RoutingNode routingNode = state.getRoutingNodes().node(node); | ||
| if (routingNode != null) { // this might happen if we haven't got the full cluster-state yet?! |
There was a problem hiding this comment.
Sadly this happens when you have a non-data node. It's been on the hit list to remove (it's annoying imo and is a leftover of the node as client days). no one got to it yet.
|
|
||
| protected void markIndicesReadOnly(Set<String> indicesToMarkReadOnly) { | ||
| // set read-only block but don't block on the response | ||
| client.admin().indices().prepareUpdateSettings(indicesToMarkReadOnly.toArray(Strings.EMPTY_ARRAY)). |
There was a problem hiding this comment.
I think we should protect for errors here and log a warning. It doesn't seem like there are protection higher up the call stacks.
There was a problem hiding this comment.
we do catch errors in InternalClusterInfoService and log a warning so I think we are good here?
There was a problem hiding this comment.
You're right - I missed it.
| (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"), | ||
| Setting.Property.Dynamic, Setting.Property.NodeScope); | ||
| public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING = | ||
| new Setting<>("cluster.routing.allocation.disk.floodstage", "95%", |
There was a problem hiding this comment.
we're missing the watermark in the setting path (i.e. disk.floodstage vs disk.watermark.floodstage). Feels a bit strange to me but I'm not a native speaker. @dakrone wdyt?
There was a problem hiding this comment.
yeah so my logic was, we have 2 watermarks (high and low) and floodstage so I left the watermark out of the key?!
There was a problem hiding this comment.
I am curious what @rjernst thinks I am ok with whatever makes sense
There was a problem hiding this comment.
I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.
| ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING, | ||
| DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING, | ||
| DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING, | ||
| DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING, |
There was a problem hiding this comment.
can we add DISK to this name (with or without watermark - see other comment). i.e. CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_DISK_WATERMARK or something?
| injector.getInstance(RoutingService.class).start(); | ||
| injector.getInstance(SearchService.class).start(); | ||
| injector.getInstance(MonitorService.class).start(); | ||
| nodeService.getMonitorService().start(); |
| toClose.add(injector.getInstance(Discovery.class)); | ||
| toClose.add(() -> stopWatch.stop().start("monitor")); | ||
| toClose.add(injector.getInstance(MonitorService.class)); | ||
| toClose.add(nodeService.getMonitorService()); |
There was a problem hiding this comment.
this is redundant. It seems to be already closed by nodeService. It's a bit weird that NodeService closes the monitor service but doesn't call own calling the stop / start methods. Not a big deal though.
There was a problem hiding this comment.
that's a leftover from a different path I tried...
|
@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already? |
dakrone
left a comment
There was a problem hiding this comment.
LGTM, I left one super minor nit, thanks for doing this!
| protected ClusterInfoService newClusterInfoService(Settings settings, ClusterService clusterService, | ||
| ThreadPool threadPool, NodeClient client) { | ||
| return new InternalClusterInfoService(settings, clusterService, threadPool, client); | ||
| ThreadPool threadPool, NodeClient client, Consumer<ClusterInfo> listeners) { |
There was a problem hiding this comment.
Minor nit: listeners -> listener
I think we can rely on the fact that if flood stage is on thant the high water mark is also on, which has already kicked in the reroute (as you say). I also think that flood stage is about making indices read only (and only that). A follow up reroute shard moving won't change that (as we don't automatically make the indices writable again). So +1 to keeping as is and not adding a reroute. |
|
Oh something else I forgot to mention, I think we need to document this feature and the new flood level also? |
| Setting.Property.Dynamic, Setting.Property.NodeScope); | ||
| public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_SETTING = | ||
| new Setting<>("cluster.routing.allocation.disk.floodstage", "95%", | ||
| (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.floodstage"), |
There was a problem hiding this comment.
It seems that we have no validation that low <= high <= flood but I think that we should?
There was a problem hiding this comment.
I am happy to do this but it should be a sep. PR it's also tricky to do since we don't have a fully picture of the new settings when they are updated. essentially I don't think we can easily protect from this today. to make it work I think we need an extra validation round when settings are updated
| (s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"), | ||
| Setting.Property.Dynamic, Setting.Property.NodeScope); | ||
| public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING = | ||
| new Setting<>("cluster.routing.allocation.disk.floodstage", "95%", |
There was a problem hiding this comment.
I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.
jasontedor
left a comment
There was a problem hiding this comment.
I left a question. I think we need docs too.
|
I was waiting for initial feedback until I doc this.. will add docs soon |
|
@jasontedor @dakrone I pushed docs and requested changes |
* master: Refactor PathTrie and RestController to use a single trie for all methods (elastic#25459) Switch indices read-only if a node runs out of disk space (elastic#25541)
Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.
The floodstage watermark is set to
95%utilization by default.Closes #24299