Skip to content

[ML] Ensure bulk requests are not over memory limit#60219

Merged
dimitris-athanasiou merged 5 commits intoelastic:masterfrom
dimitris-athanasiou:ensure-ml-bulk-requests-are-not-over-limit
Jul 28, 2020
Merged

[ML] Ensure bulk requests are not over memory limit#60219
dimitris-athanasiou merged 5 commits intoelastic:masterfrom
dimitris-athanasiou:ensure-ml-bulk-requests-are-not-over-limit

Conversation

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor

Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in #58885.

Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in elastic#58885.
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent self-requested a review July 27, 2020 14:38
Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to use this type of logic in the anomaly job result bulk indexer as well.

modelLoadingService = mock(ModelLoadingService.class);
processManager = new AnalyticsProcessManager(client, executorServiceForJob, executorServiceForProcess, processFactory, auditor,
trainedModelProvider, modelLoadingService, resultsPersisterService, 1);
processManager = new AnalyticsProcessManager(Settings.builder().build(), client, executorServiceForJob, executorServiceForProcess,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Settings.EMPTY work here as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to change those and forgot. Thanks for spotting!


verifyNoMoreInteractions(resultsPersisterService);
List<BulkRequest> capturedBulkRequests = bulkRequestCaptor.getAllValues();
assertThat(capturedBulkRequests.size(), equalTo(1));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasSize matcher could be used here instead.


public class MlBulkIndexerTests extends ESTestCase {

private List<BulkRequest> executedBulkRequests = new ArrayList<>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting reinitialised for each test so it cannot be final.

* that do exceed a 1000 operations or half the available memory
* limit for indexing.
*/
public class MlBulkIndexer implements AutoCloseable {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no ML-specific code in this class. Would it make sense to rename it to BulkIndexer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid calling it plainly BulkIndexer as it sounds too much like a basic ES component. I decided to prefix with Ml to indicate this is a utility designed to serve purposes within the ML plugin. If we end up using it in other places then we shall move it and rename it accordingly.

@dimitris-athanasiou
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit 9528f9c into elastic:master Jul 28, 2020
@dimitris-athanasiou dimitris-athanasiou deleted the ensure-ml-bulk-requests-are-not-over-limit branch July 28, 2020 11:46
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jul 28, 2020
Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in elastic#58885.

Backport of elastic#60219
dimitris-athanasiou added a commit that referenced this pull request Jul 28, 2020
…0283)

Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in #58885.

Backport of #60219
dimitris-athanasiou added a commit that referenced this pull request Jul 28, 2020
…0284)

Data frame analytics jobs that work with very large datasets
may produce bulk requests that are over the memory limit
for indexing. This commit adds a helper class that bundles
index requests in bulk requests that steer away from the
memory limit. We then use this class both from the results
joiner and the inference runner ensuring data frame analytics
jobs do not generate bulk requests that are too large.

Note the limit was implemented in #58885.

Backport of #60219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants