Shards with heavy indexing should get more of the indexing buffer by mikemccand · Pull Request #14121 · elastic/elasticsearch

mikemccand · 2015-10-14T20:54:28Z

Today we take the total indexing buffer (default: 10% of heap) and divide it equally across all active shards.

But this is a sub-optimal usage of RAM for indexing: maybe the node has a bunch of small shards (e.g. Marvel) which require hardly any indexing heap, but were assigned a big chunk of heap (which typically goes mostly unused), while other heavy indexing shards were assigned the same indexing buffer but could effectively make use of much more.

This problem is very nearly the same issue IndexWriter faces, being told it has an X MB overall indexing buffer to use and then having to manage the N separate in-memory segments (one per thread).

I think we (ES) should take the same approach as IndexWriter, except across shards on the node: tell Lucene each shard has an effectively unlimited indexing buffer, but then periodically sum up the actual bytes used across all and when it's over the node's budget, ask the most-heap-consuming shard(s) to refresh to clear the heap.

This should also reduce merge pressure across the node since we'd typically be flushing fewer, larger segments, and helps smooth out IO pressure somewhat (instead of N shards trying to write at once, we stage it over time).

I also removed all configuration associated with the translog buffer (index.translog.fs.buffer_size): it's now hardwired to 32 KB. I don't understand why we need this buffer to be tunable: let the OS manage RAM assigned for IO write buffering / dirty pages?

nik9000 · 2015-10-14T20:59:53Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

Is idle really needed any more?

Alas it's still needed. I removed it at first! And was so happy about the simplification :) But then it broke sync'd flush, which is triggered by that onShardInactive call.

can we keep it internally? I mean we could store it alongside the active boolean?

nik9000 · 2015-10-14T21:11:41Z

I wonder if we should ever ask for more refreshes than we need to get below the buffer. I think the old buffer would keep the heap usage of the buffer under or at the buffer's size but this scheme will keep the buffer closer to full all the time. In the most degenerate case of a bunch shards (n) being written to at the same rate this implementation will end up with more "floating".

mikemccand · 2015-10-14T21:36:50Z

I think the old buffer would keep the heap usage of the buffer under or at the buffer's size but this scheme will keep the buffer closer to full all the time.

Actually Lucene's IndexWriter works just like this change, riding the buffer near full, i.e. it waits for the heap used to cross the limit, then it picks the biggest in-memory segment(s) to write and writes it until heap used is under the budget again.

If we wanted we could add some hysteresis, e.g. when we cross full, we drive the buffer back down to maybe 75% of full, to give it a sharper sawtooth pattern ... but I don't think this is really necessary.

Typically the scheduled refresh (default: every 1s) is going to free the RAM well before IndexingMemoryController logic here kicks in.

nik9000 · 2015-10-14T21:45:37Z

Typically the scheduled refresh (default: every 1s) is going to free the RAM well before IndexingMemoryController logic here kicks in.

Yeah - I suspect so. I just thought about it and figured it was worth mentioning.

s1monw · 2015-10-17T19:16:08Z

core/src/main/java/org/elasticsearch/cluster/ClusterModule.java

mikemccand · 2015-10-20T16:28:51Z

I chatted with @s1monw about this ... I think we can add an API to IndexWriter to give us more granular control (write just the biggest segment to disk), and more specific control (just write the segment to disk, don't refresh it) to move our dirty bytes to the OS so it can move them to disk. I opened https://issues.apache.org/jira/browse/LUCENE-6849 to work on this.

I agree the stalling issue is important, so just pretending "0 bytes heap used" as soon as we start moving bytes to disk, is dangerous. I'll change the PR to track "pending dirty bytes moving to disk", and if that pending bytes is too large vs our budget, we need throttle incoming indexing, hopefully just tapping into the index throttling we already have for when merges are falling behind. The OS will do its own scary back-pressure here (blocking any thread, or maybe/probably the whole process, that's attempting to write to disk) when its efforts to move dirty bytes to disk are falling behind.

mikemccand · 2016-01-11T15:17:14Z

I merged master and fixed IMC to "just" be another IndexOperationListener ... I think this is ready.

However, a number of the CI jobs have been failing: https://build-us-01.elastic.co/view/Elasticsearch/job/elastic+elasticsearch+fair_indexing_buffers+periodic/

The failures are weird, like something is interrupting the build (Thread.interrupt) vs true test failures caused by this change, I think... but I'll dig.

s1monw · 2016-01-11T19:07:52Z

core/src/main/java/org/elasticsearch/index/IndexService.java

Can this be a IndexingOperationListener... listeners that way we don't introduce a hard dependency on IMC

s1monw · 2016-01-11T19:27:14Z

@mikemccand awesome! I left a bunch of comments but this looks fantastic

s1monw · 2016-01-12T08:26:47Z

LGTM

The indexing buffer on a node (default: 10% of the JVM heap) is now a "shared pool" across all shards on that node. This way, shards doing intense indexing can use much more than other shards doing only light indexing, and only once the sum of all indexing buffers across all shards exceeds the node's indexing buffer will we ask shards to move recently indexed documents to segments on disk.

mikemccand · 2016-01-12T10:17:18Z

I removed 2.3.0 from this ... it's a big change, and its precursors haven't been ported to 2.3.0, so I think it should be in our next major release only.

mikemccand added 5 commits October 14, 2015 05:41

a start

b3357f0

put back active/inactive logic, for sync'd flush

6ae8ca9

fix failing test

1b9e9ed

don't call IMC.forceCheck when going active; remove nocommit/sops

77c2445

Merge branch 'master' into fair_indexing_buffers

4639657

mikemccand added >enhancement :Core/Infra/Core Core issues without another label v2.1.0 v5.0.0-alpha1 v2.2.0 labels Oct 14, 2015

mikemccand self-assigned this Oct 14, 2015

nik9000 reviewed Oct 14, 2015
View reviewed changes

set active to true on indexing ops

0a2a7f2

s1monw reviewed Oct 17, 2015
View reviewed changes

core/src/main/java/org/elasticsearch/cluster/ClusterModule.java Outdated

Copy link
Copy Markdown

Contributor

s1monw Oct 17, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good! :)

force check after X bytes indexed

c66b05d

clintongormley removed the v2.1.0 label Nov 20, 2015

This was referenced Nov 21, 2015

indices.memory.index_buffer_size optimization #6543

Closed

indices.memory.index_buffer_size is spread evenly across active shards #7441

Closed

remove 'white lie' and tracking refreshing bytes explicitly

99e328c

mikemccand added 2 commits January 11, 2016 05:36

merge master

f3de778

IMC is now just another IndexingOperationListener

5e7144f

s1monw reviewed Jan 11, 2016
View reviewed changes

feedback

07e8370

mikemccand merged commit b4a095d into elastic:master Jan 12, 2016

mikemccand removed the v2.3.0 label Jan 12, 2016

bleskes mentioned this pull request Jan 14, 2016

Move RefreshTask into IndexService and use since task per index #15933

Merged

This was referenced Jan 26, 2016

Mark shard active during recovery; push settings after engine finally inits #16250

Closed

Index buffer - clarify min and max settings relative to size #16115

Closed

mikemccand mentioned this pull request Feb 17, 2016

make EngineConfig.INACTIVE_SHARD_INDEXING_BUFFER configurable for ES-1.7.x #16688

Closed

mikemccand mentioned this pull request May 31, 2016

Remove index_writer_max_memory stat from segment stats #18651

Merged

jpountz mentioned this pull request Jun 21, 2016

Add a how-to section to the docs. #18998

Merged

jpountz mentioned this pull request Jun 5, 2018

Give the engine the whole index buffer size on init. #31105

Merged

Conversation

mikemccand commented Oct 14, 2015

Uh oh!

nik9000 Oct 14, 2015

Choose a reason for hiding this comment

Uh oh!

mikemccand Oct 14, 2015

Choose a reason for hiding this comment

Uh oh!

s1monw Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Oct 14, 2015

Uh oh!

mikemccand commented Oct 14, 2015

Uh oh!

nik9000 commented Oct 14, 2015

Uh oh!

s1monw Oct 17, 2015

Choose a reason for hiding this comment

Uh oh!

mikemccand commented Oct 20, 2015

Uh oh!

mikemccand commented Jan 11, 2016

Uh oh!

s1monw Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

s1monw commented Jan 11, 2016

Uh oh!

s1monw commented Jan 12, 2016

Uh oh!

mikemccand commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants