Skip to content

Shards with heavy indexing should get more of the indexing buffer#14121

Merged
mikemccand merged 24 commits intoelastic:masterfrom
mikemccand:fair_indexing_buffers
Jan 12, 2016
Merged

Shards with heavy indexing should get more of the indexing buffer#14121
mikemccand merged 24 commits intoelastic:masterfrom
mikemccand:fair_indexing_buffers

Conversation

@mikemccand
Copy link
Copy Markdown
Contributor

Today we take the total indexing buffer (default: 10% of heap) and divide it equally across all active shards.

But this is a sub-optimal usage of RAM for indexing: maybe the node has a bunch of small shards (e.g. Marvel) which require hardly any indexing heap, but were assigned a big chunk of heap (which typically goes mostly unused), while other heavy indexing shards were assigned the same indexing buffer but could effectively make use of much more.

This problem is very nearly the same issue IndexWriter faces, being told it has an X MB overall indexing buffer to use and then having to manage the N separate in-memory segments (one per thread).

I think we (ES) should take the same approach as IndexWriter, except across shards on the node: tell Lucene each shard has an effectively unlimited indexing buffer, but then periodically sum up the actual bytes used across all and when it's over the node's budget, ask the most-heap-consuming shard(s) to refresh to clear the heap.

This should also reduce merge pressure across the node since we'd typically be flushing fewer, larger segments, and helps smooth out IO pressure somewhat (instead of N shards trying to write at once, we stage it over time).

I also removed all configuration associated with the translog buffer (index.translog.fs.buffer_size): it's now hardwired to 32 KB. I don't understand why we need this buffer to be tunable: let the OS manage RAM assigned for IO write buffering / dirty pages?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is idle really needed any more?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas it's still needed. I removed it at first! And was so happy about the simplification :) But then it broke sync'd flush, which is triggered by that onShardInactive call.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep it internally? I mean we could store it alongside the active boolean?

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Oct 14, 2015

I wonder if we should ever ask for more refreshes than we need to get below the buffer. I think the old buffer would keep the heap usage of the buffer under or at the buffer's size but this scheme will keep the buffer closer to full all the time. In the most degenerate case of a bunch shards (n) being written to at the same rate this implementation will end up with more "floating".

@mikemccand
Copy link
Copy Markdown
Contributor Author

I think the old buffer would keep the heap usage of the buffer under or at the buffer's size but this scheme will keep the buffer closer to full all the time.

Actually Lucene's IndexWriter works just like this change, riding the buffer near full, i.e. it waits for the heap used to cross the limit, then it picks the biggest in-memory segment(s) to write and writes it until heap used is under the budget again.

If we wanted we could add some hysteresis, e.g. when we cross full, we drive the buffer back down to maybe 75% of full, to give it a sharper sawtooth pattern ... but I don't think this is really necessary.

Typically the scheduled refresh (default: every 1s) is going to free the RAM well before IndexingMemoryController logic here kicks in.

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Oct 14, 2015

Typically the scheduled refresh (default: every 1s) is going to free the RAM well before IndexingMemoryController logic here kicks in.

Yeah - I suspect so. I just thought about it and figured it was worth mentioning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! :)

@mikemccand
Copy link
Copy Markdown
Contributor Author

I chatted with @s1monw about this ... I think we can add an API to IndexWriter to give us more granular control (write just the biggest segment to disk), and more specific control (just write the segment to disk, don't refresh it) to move our dirty bytes to the OS so it can move them to disk. I opened https://issues.apache.org/jira/browse/LUCENE-6849 to work on this.

I agree the stalling issue is important, so just pretending "0 bytes heap used" as soon as we start moving bytes to disk, is dangerous. I'll change the PR to track "pending dirty bytes moving to disk", and if that pending bytes is too large vs our budget, we need throttle incoming indexing, hopefully just tapping into the index throttling we already have for when merges are falling behind. The OS will do its own scary back-pressure here (blocking any thread, or maybe/probably the whole process, that's attempting to write to disk) when its efforts to move dirty bytes to disk are falling behind.

@mikemccand
Copy link
Copy Markdown
Contributor Author

I merged master and fixed IMC to "just" be another IndexOperationListener ... I think this is ready.

However, a number of the CI jobs have been failing: https://build-us-01.elastic.co/view/Elasticsearch/job/elastic+elasticsearch+fair_indexing_buffers+periodic/

The failures are weird, like something is interrupting the build (Thread.interrupt) vs true test failures caused by this change, I think... but I'll dig.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a IndexingOperationListener... listeners that way we don't introduce a hard dependency on IMC

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Jan 11, 2016

@mikemccand awesome! I left a bunch of comments but this looks fantastic

@s1monw
Copy link
Copy Markdown
Contributor

s1monw commented Jan 12, 2016

LGTM

mikemccand pushed a commit that referenced this pull request Jan 12, 2016
The indexing buffer on a node (default: 10% of the JVM heap) is now a "shared pool" across all shards on that node.  This way, shards doing intense indexing can use much more than other shards doing only light indexing, and only once the sum of all indexing buffers across all shards exceeds the node's indexing buffer will we ask shards to move recently indexed documents to segments on disk.
@mikemccand mikemccand merged commit b4a095d into elastic:master Jan 12, 2016
@mikemccand mikemccand removed the v2.3.0 label Jan 12, 2016
@mikemccand
Copy link
Copy Markdown
Contributor Author

I removed 2.3.0 from this ... it's a big change, and its precursors haven't been ported to 2.3.0, so I think it should be in our next major release only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label >enhancement v5.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants