Shards with heavy indexing should get more of the indexing buffer#14121
Shards with heavy indexing should get more of the indexing buffer#14121mikemccand merged 24 commits intoelastic:masterfrom
Conversation
There was a problem hiding this comment.
Is idle really needed any more?
There was a problem hiding this comment.
Alas it's still needed. I removed it at first! And was so happy about the simplification :) But then it broke sync'd flush, which is triggered by that onShardInactive call.
There was a problem hiding this comment.
can we keep it internally? I mean we could store it alongside the active boolean?
|
I wonder if we should ever ask for more refreshes than we need to get below the buffer. I think the old buffer would keep the heap usage of the buffer under or at the buffer's size but this scheme will keep the buffer closer to full all the time. In the most degenerate case of a bunch shards (n) being written to at the same rate this implementation will end up with more "floating". |
Actually Lucene's If we wanted we could add some hysteresis, e.g. when we cross full, we drive the buffer back down to maybe 75% of full, to give it a sharper sawtooth pattern ... but I don't think this is really necessary. Typically the scheduled refresh (default: every 1s) is going to free the RAM well before |
Yeah - I suspect so. I just thought about it and figured it was worth mentioning. |
|
I chatted with @s1monw about this ... I think we can add an API to I agree the stalling issue is important, so just pretending "0 bytes heap used" as soon as we start moving bytes to disk, is dangerous. I'll change the PR to track "pending dirty bytes moving to disk", and if that pending bytes is too large vs our budget, we need throttle incoming indexing, hopefully just tapping into the index throttling we already have for when merges are falling behind. The OS will do its own scary back-pressure here (blocking any thread, or maybe/probably the whole process, that's attempting to write to disk) when its efforts to move dirty bytes to disk are falling behind. |
|
I merged master and fixed IMC to "just" be another However, a number of the CI jobs have been failing: https://build-us-01.elastic.co/view/Elasticsearch/job/elastic+elasticsearch+fair_indexing_buffers+periodic/ The failures are weird, like something is interrupting the build ( |
There was a problem hiding this comment.
Can this be a IndexingOperationListener... listeners that way we don't introduce a hard dependency on IMC
|
@mikemccand awesome! I left a bunch of comments but this looks fantastic |
|
LGTM |
The indexing buffer on a node (default: 10% of the JVM heap) is now a "shared pool" across all shards on that node. This way, shards doing intense indexing can use much more than other shards doing only light indexing, and only once the sum of all indexing buffers across all shards exceeds the node's indexing buffer will we ask shards to move recently indexed documents to segments on disk.
|
I removed 2.3.0 from this ... it's a big change, and its precursors haven't been ported to 2.3.0, so I think it should be in our next major release only. |
Today we take the total indexing buffer (default: 10% of heap) and divide it equally across all active shards.
But this is a sub-optimal usage of RAM for indexing: maybe the node has a bunch of small shards (e.g. Marvel) which require hardly any indexing heap, but were assigned a big chunk of heap (which typically goes mostly unused), while other heavy indexing shards were assigned the same indexing buffer but could effectively make use of much more.
This problem is very nearly the same issue
IndexWriterfaces, being told it has an X MB overall indexing buffer to use and then having to manage the N separate in-memory segments (one per thread).I think we (ES) should take the same approach as
IndexWriter, except across shards on the node: tell Lucene each shard has an effectively unlimited indexing buffer, but then periodically sum up the actual bytes used across all and when it's over the node's budget, ask the most-heap-consuming shard(s) to refresh to clear the heap.This should also reduce merge pressure across the node since we'd typically be flushing fewer, larger segments, and helps smooth out IO pressure somewhat (instead of N shards trying to write at once, we stage it over time).
I also removed all configuration associated with the translog buffer (
index.translog.fs.buffer_size): it's now hardwired to 32 KB. I don't understand why we need this buffer to be tunable: let the OS manage RAM assigned for IO write buffering / dirty pages?