Using BulkProcessor2 in RollupShardIndexer by masseyke · Pull Request #94197 · elastic/elasticsearch

masseyke · 2023-02-28T15:33:51Z

In #91238 we rewrote BulkProcessor to avoid deadlock that had been seen in the IlmHistoryStore. This PR ports rollup over to the new BulkProcessor2 implementation. BulkProcessor2 always runs asynchronously, meaning that RollupShardIndexer has to explicitly check for failures and throw an exception, rather than relying on the exception being thrown in-thread during the bulk indexing.

elasticsearchmachine · 2023-03-06T16:23:57Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

martijnvg

I think this looks good. I asked a few questions just for my own understanding.

martijnvg · 2023-03-07T08:31:12Z

x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

            .setBulkActions(ROLLUP_BULK_ACTIONS)
            .setBulkSize(ROLLUP_BULK_SIZE)
-            // execute the bulk request on the same thread
-            .setConcurrentRequests(0)


I think this is a change in runtime behaviour? BulkProcessor2 would execute the bulk request on different threads and if multiple bulk requests exceed max bytes in flight, then a rejected exception is thrown. Just double checking, I think BulkProcessor2 would work well in this context.

Yes you are correct that if searches are happening faster than bulk indexing then eventually we'd start getting EsRejectedExecutionExceptions and losing data. I've added bulkProcessorTooFullMonitor and logic around it to exert backpressure and avoid this. I also added DownsampleActionSingleNodeTests.testTooManyBytesInFlight() to show this problem (and fix).

martijnvg · 2023-03-07T08:31:41Z

x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

                XContentBuilder doc = rollupBucketBuilder.buildRollupDocument();
                indexBucket(doc);
            }
-            bulkProcessor.flush();


Not needed because this would happen during closing of BulkProcessor2?

That's right. There's no need to explicitly flush BulkProcessor2 (it doesn't even have a public flush method).

martijnvg · 2023-03-07T08:33:29Z

x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

            TimeValue.timeValueMillis(System.currentTimeMillis() - startTime)
        );

+        if (task.getNumFailed() > 0) {


When this if statement is reached then all bulk requests have been executed?

Hmm, close only waits up to 30 seconds. So I guess it's possible that we don't have any failures here, and then by the time the next line runs we do, but task.getNumIndexed() == task.getNumSent() so we report success. I think if I swap these two blocks it will solve that, right? Also, do we want to wait more than 30 seconds here? That's effectively 30 seconds for the last 50 MB (the max amount of in-flight bytes we allow) to flush, which seems like plenty.

masseyke · 2023-03-08T23:23:39Z

@elasticmachine update branch

martijnvg · 2023-03-10T16:15:07Z

x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/downsample/RollupShardIndexer.java

    private final Rounding.Prepared rounding;
    private final List<FieldValueFetcher> fieldValueFetchers;
    private final RollupShardTask task;
+    /*


Maybe there should be a different version of BulkProcessor2 that can execute on the same thread? (Or BulkProcessor2 should be modified to handle this?)

This change now adds quite some additional concurrency logic.

We could move the old BulkProcessor into this package? I think BulkProcessor2 is faster (since it allows for multiple index requests to be in flight at once), but I don't actually have any performance tests (I don't know if there are any for this).

What if I add a new addWithBackpressure method in BulkProcessor2 (a separate method will allow us to pass in a Supplier to check whether RollupShardIndexer.abort is true), and move this code in there? If I understand correctly, your concern is not that we run everything on the current thread (which would be fairly difficult with BulkProcessor2), but that you don't want to get exceptions if the search code is running much faster than the index code, and you don't want the complexity (or the need to maintain it) in TSDB code. Right?

What if I add a new addWithBackpressure method in BulkProcessor2 (a separate method will allow us to pass in a Supplier to check whether RollupShardIndexer.abort is true), and move this code in there?

👍 this sounds good to me. The code added in this change to RollupShardSearcher isn't about rolling up data, but being able to bulk index on the current thread.

OK I've added this change with #94599 (which drastically reduced the changes needed in RollupShardIndexer.

martijnvg

LGTM, thanks for iterating here.

martijnvg · 2023-06-12T13:33:00Z

It looks like this change also had an positive impact on the downsampling to 1 minute fixed interval tsdb benchmark:

(visualization is from 16th of march until 27th of march)
On the day this got merged, downsampling the tsdb index to 1 minute interval buckets went from ~1400000 ms to ~800000 ms.

masseyke added 2 commits February 27, 2023 17:41

Migrating rollup to BulkProcessor2

f03ebf0

Fixing RollupShardIndexer to fail on watcher history index failures

9f33fd9

elasticsearchmachine added the v8.8.0 label Feb 28, 2023

masseyke added :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Feb 28, 2023

masseyke mentioned this pull request Mar 2, 2023

Using BulkProcessor2 in Deprecation Logging #94211

Merged

masseyke added 2 commits March 6, 2023 08:32

Merge branch 'main' into using-bulkprocessor2-in-rollup

d5457fb

fixing merge from main

5cc65ff

masseyke marked this pull request as ready for review March 6, 2023 16:23

martijnvg reviewed Mar 7, 2023

View reviewed changes

masseyke marked this pull request as draft March 7, 2023 13:49

masseyke added 4 commits March 7, 2023 12:40

Wait/notify solution for producer outrunning consumer problem

a9c50c1

Adding comments

e834488

changing from wait/notify to a Lock and Condition

0b04213

moving check for failures to after the check that all requests completed

ca9882b

masseyke requested a review from martijnvg March 8, 2023 23:06

masseyke marked this pull request as ready for review March 8, 2023 23:07

Merge branch 'main' into using-bulkprocessor2-in-rollup

4394063

martijnvg reviewed Mar 10, 2023

View reviewed changes

masseyke mentioned this pull request Mar 21, 2023

Adding BulkProcessor2.addWithBackpressure() #94599

Merged

masseyke added 2 commits March 21, 2023 13:44

Merge branch 'main' into using-bulkprocessor2-in-rollup

d50a2d7

moving code into BulkProcessor2

3ec0f14

masseyke requested a review from martijnvg March 21, 2023 18:53

martijnvg approved these changes Mar 22, 2023

View reviewed changes

masseyke merged commit c97cccb into elastic:main Mar 22, 2023

masseyke deleted the using-bulkprocessor2-in-rollup branch March 22, 2023 15:31

Conversation

masseyke commented Feb 28, 2023

Uh oh!

elasticsearchmachine commented Mar 6, 2023

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke Mar 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke commented Mar 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masseyke Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Jun 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masseyke Mar 8, 2023 •

edited

Loading

masseyke Mar 10, 2023 •

edited

Loading

masseyke Mar 21, 2023 •

edited

Loading

martijnvg commented Jun 12, 2023 •

edited

Loading