[ML] Limit in flight requests when indexing model download parts by davidkyle · Pull Request #112992 · elastic/elasticsearch

davidkyle · 2024-09-17T09:51:47Z

#111684 improved the model install time by using multiple streams and threads to download and write the model parts. The change was reverted in #112961 after it was discovered the be the cause of Out Of Memory exceptions.

The design relied on using a fixed size thread pool to limit the concurrent downloads and hence also manage memory usage. However, the indexing of the downloaded document was performed async which meant a new download request would be forked and executed while the write request was still in flight leading to large numbers of in flight requests. The fix here is to block on the index write.

The first commit is the revert of the revert, the later commits introduce the blocking write and reuse a byte buffer that was being recreated for every downloaded part. Allocating that byte buffer is now protected by a circuit breaker.

Labelled as a non issue because the code that caused the OOM was reverted before it made it to a production environment.

…streams (#111684)" (#112961)" This reverts commit 7e04d8e.

elasticsearchmachine · 2024-09-17T09:52:11Z

Pinging @elastic/ml-core (Team:ML)

pxsalehi

LGTM for the newer changes. Although, ideally there should be a test that shows the number of parallel downloads is limited. In any case, I'm gonna leave the final LGTM to ML, if that's ok.

prwhelan · 2024-09-17T13:36:40Z

...kage-loader/src/main/java/org/elasticsearch/xpack/ml/packageloader/action/ModelImporter.java

+            true
+        );
+
+        client.execute(PutTrainedModelDefinitionPartAction.INSTANCE, modelPartRequest).actionGet();


This is the part that is blocking and/or slowing down the indexing, correct? We'll wait for the client response rather than continue asynchronously?

The gain is from using 5 threads to stream the download and index the parts. The non-blocking write meant that we had more than 5 in flight requests (the download is faster than the indexing) and that was causing the OOM. In order to limit the number of requests to at most 5 there has to be some element of blocking. Model download uses it's a dedicated thread pool so the block does not starve other parts of the code of resources

prwhelan · 2024-09-17T13:55:43Z

...der/src/test/java/org/elasticsearch/xpack/ml/packageloader/action/ModelLoaderUtilsTests.java

+        for (int i = 0; i < ranges.size() - 1; i++) {
+            assertThat(ranges.get(i).rangeStart(), is(startBytes));
+            long end = startBytes + ((long) ranges.get(i).numParts() * chunkSize) - 1;
+            assertThat(ranges.get(i).rangeEnd(), is(end));
+            long expectedNumBytesInRange = (long) chunkSize * ranges.get(i).numParts() - 1;
+            assertThat(ranges.get(i).rangeEnd() - ranges.get(i).rangeStart(), is(expectedNumBytesInRange));
+            assertThat(ranges.get(i).startPart(), is(startPartIndex));


davidkyle · 2024-09-23T09:21:16Z

@pxsalehi I pushed a small change to add a circuit breaker check before allocating the byte butter. Please can you take a look at 78905e5. Thanks

davidkyle · 2024-09-23T12:49:15Z

@elasticmachine update branch

pxsalehi

@pxsalehi I pushed a small change to add a circuit breaker check before allocating the byte butter. Please can you take a look at 78905e5. Thanks

LGTM. Thanks.

…stic#112992) Restores the changes from elastic#111684 which uses multiple streams to improve the time to download and install the built in ml models. The first iteration has a problem where the number of in-flight requests was not properly limited which is fixed here. Additionally there are now circuit breaker checks on allocating the buffer used to store the model definition.

elasticsearchmachine · 2024-09-25T09:14:37Z

💔 Backport failed

Status	Branch	Result
✅	8.x
❌	8.15	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 112992

…2992) (#113514) Restores the changes from #111684 which uses multiple streams to improve the time to download and install the built in ml models. The first iteration has a problem where the number of in-flight requests was not properly limited which is fixed here. Additionally there are now circuit breaker checks on allocating the buffer used to store the model definition.

…stic#112992) Restores the changes from elastic#111684 which uses multiple streams to improve the time to download and install the built in ml models. The first iteration has a problem where the number of in-flight requests was not properly limited which is fixed here. Additionally there are now circuit breaker checks on allocating the buffer used to store the model definition. # Conflicts: # x-pack/plugin/ml-package-loader/src/main/java/org/elasticsearch/xpack/ml/packageloader/action/TransportLoadTrainedModelPackage.java # x-pack/plugin/ml-package-loader/src/test/java/org/elasticsearch/xpack/ml/packageloader/action/TransportLoadTrainedModelPackageTests.java

…2992) (#113710) Restores the changes from #111684 which uses multiple streams to improve the time to download and install the built in ml models. The first iteration has a problem where the number of in-flight requests was not properly limited which is fixed here. Additionally there are now circuit breaker checks on allocating the buffer used to store the model definition. # Conflicts: # x-pack/plugin/ml-package-loader/src/main/java/org/elasticsearch/xpack/ml/packageloader/action/TransportLoadTrainedModelPackage.java # x-pack/plugin/ml-package-loader/src/test/java/org/elasticsearch/xpack/ml/packageloader/action/TransportLoadTrainedModelPackageTests.java

davidkyle added 3 commits September 17, 2024 10:06

Revert "Revert "[ML] Downloaded and write model parts using multiple …

85375f6

…streams (#111684)" (#112961)" This reverts commit 7e04d8e.

Switch to blocking write

1136b00

Reuse buffer

a728ddf

davidkyle added :ml Machine learning auto-backport-and-merge v8.16.0 v8.15.2 v9.0.0 labels Sep 17, 2024

davidkyle requested a review from pxsalehi September 17, 2024 09:51

elasticsearchmachine added the Team:ML Meta label for the ML team label Sep 17, 2024

davidkyle added the >non-issue label Sep 17, 2024

pxsalehi reviewed Sep 17, 2024

View reviewed changes

prwhelan approved these changes Sep 17, 2024

View reviewed changes

elasticsearchmachine added v8.15.3 and removed v8.15.2 labels Sep 19, 2024

davidkyle added 2 commits September 22, 2024 11:28

add circuit breaker check

78905e5

Unmute model download test

37869c8

Mute platform specific test

7b9e193

Merge branch 'main' into limit-in-flight-requests

923cc38

pxsalehi reviewed Sep 24, 2024

View reviewed changes

davidkyle merged commit 7a0f4ee into main Sep 25, 2024

davidkyle deleted the limit-in-flight-requests branch September 25, 2024 09:10

davidkyle mentioned this pull request Sep 25, 2024

[8.x] [ML] Limit in flight requests when indexing model download parts (#112992) #113514

Merged

elasticsearchmachine added the backport pending label Sep 25, 2024

davidkyle mentioned this pull request Sep 27, 2024

[ML] Limit in flight requests when indexing model download parts #113710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Limit in flight requests when indexing model download parts#112992

[ML] Limit in flight requests when indexing model download parts#112992
davidkyle merged 7 commits intomainfrom
limit-in-flight-requests

davidkyle commented Sep 17, 2024 •

edited

Loading

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

pxsalehi left a comment

Uh oh!

prwhelan Sep 17, 2024

Uh oh!

davidkyle Sep 23, 2024

Uh oh!

prwhelan Sep 17, 2024

Uh oh!

davidkyle commented Sep 23, 2024

Uh oh!

davidkyle commented Sep 23, 2024

Uh oh!

pxsalehi left a comment

Uh oh!

elasticsearchmachine commented Sep 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

davidkyle commented Sep 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 17, 2024

Uh oh!

pxsalehi left a comment

Choose a reason for hiding this comment

Uh oh!

prwhelan Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle Sep 23, 2024

Choose a reason for hiding this comment

Uh oh!

prwhelan Sep 17, 2024

Choose a reason for hiding this comment

Uh oh!

davidkyle commented Sep 23, 2024

Uh oh!

davidkyle commented Sep 23, 2024

Uh oh!

pxsalehi left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Sep 25, 2024

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davidkyle commented Sep 17, 2024 •

edited

Loading