[ML] Fixing inference stats race condition by benwtrent · Pull Request #55163 · elastic/elasticsearch

benwtrent · 2020-04-14T14:48:51Z

updateAndGet could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
* @param updateFunction a side-effect-free function
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a ReadWriteLock. The LongAdder objects allows fast thread safe writes in high contention environments. These can be protected by the ReadWriteLock::readLock.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with ReadWriteLock::writeLock.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes #54786

elasticmachine · 2020-04-14T14:48:53Z

Pinging @elastic/ml-core (:ml)

...ore/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/InferenceStats.java

benwtrent · 2020-04-16T17:32:42Z

run elasticsearch-ci/2

benwtrent · 2020-04-16T18:10:26Z

run elasticsearch-ci/2

benwtrent · 2020-04-16T18:52:22Z

run elasticsearch-ci/2

benwtrent · 2020-04-16T19:30:52Z

run elasticsearch-ci/2

…nce-stats-race

...ore/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/InferenceStats.java

...multi-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/InferenceIngestIT.java

...n/ml/src/main/java/org/elasticsearch/xpack/ml/dataframe/process/AnalyticsProcessManager.java

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/TrainedModelStatsService.java

...in/ml/src/test/java/org/elasticsearch/xpack/ml/inference/loadingservice/LocalModelTests.java

...multi-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/InferenceIngestIT.java

...k/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/TrainedModelStatsService.java

przemekwitek

LGTM

davidkyle

LGTM

benwtrent · 2020-04-17T18:45:38Z

run elasticsearch-ci/2

przemekwitek · 2020-04-20T08:18:09Z

...multi-node-tests/src/test/java/org/elasticsearch/xpack/ml/integration/InferenceIngestIT.java

+                statsResponse = client().performRequest(new Request("GET", "_ml/inference/" + regressionModelId + "/_stats"));
+                assertThat(EntityUtils.toString(statsResponse.getEntity()), containsString("\"inference_count\":10"));
+            } catch (ResponseException ex) {
+                //this could just mean shard failures.


Shouldn't this "catch" clause call "Assert.fail()"?
If it doesn't I think the whole block won't be retried when the ResponseException happens...

benwtrent · 2020-04-20T16:44:52Z

@elasticmachine update branch

benwtrent · 2020-04-20T18:32:37Z

@elasticmachine update branch

benwtrent · 2020-04-20T19:06:32Z

run elasticsearch-ci/1

@param

`updateAndGet` could actually call the internal method more than once on contention. If I read the JavaDocs, it says: ```* @param updateFunction a side-effect-free function``` So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted. To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`. When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`. This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference. I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO). closes elastic#54786

@param

`updateAndGet` could actually call the internal method more than once on contention. If I read the JavaDocs, it says: ```* @param updateFunction a side-effect-free function``` So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted. To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`. When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`. This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference. I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO). closes #54786

[ML] Fixing inference stats race condition

f7b30fc

benwtrent added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.8.0 labels Apr 14, 2020

przemekwitek reviewed Apr 14, 2020

View reviewed changes

...ore/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/InferenceStats.java Outdated Show resolved Hide resolved

...ore/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/InferenceStats.java Outdated Show resolved Hide resolved

benwtrent and others added 9 commits April 14, 2020 15:05

attempting to address race conditions and adding some logging

ace8011

waiting for yellow on write alias

9bf4dfe

verifying the indices exist (hopefully)

e3932f9

remove unused import

bdca9cb

ignoring bad index creation

16461ef

Merge branch 'master' into feature/ml-inference-stats-race

89e967b

separating out model names

f922584

cleaning up statts indexer

aa3957c

flush stats when model is no longer referenced

4e4f8fd

benwtrent added 3 commits April 17, 2020 07:23

removing debug logging

69550e9

Merge remote-tracking branch 'upstream/master' into feature/ml-infere…

8176efc

…nce-stats-race

removing unused import + variable

008e738

davidkyle reviewed Apr 17, 2020

View reviewed changes

benwtrent added the >refactoring label Apr 17, 2020

addressing PR comments

2aa965f

przemekwitek reviewed Apr 17, 2020

View reviewed changes

addressing pr comments

33d4ca7

przemekwitek approved these changes Apr 17, 2020

View reviewed changes

fixing unused import

419be4c

davidkyle approved these changes Apr 17, 2020

View reviewed changes

przemekwitek reviewed Apr 20, 2020

View reviewed changes

benwtrent added 2 commits April 20, 2020 07:43

Merge branch 'master' into feature/ml-inference-stats-race

2480e75

addressing PR comments

a5ef177

elasticmachine and others added 2 commits April 20, 2020 12:44

Merge branch 'master' into feature/ml-inference-stats-race

969ad44

fixing test

28ba96c

Merge branch 'master' into feature/ml-inference-stats-race

0b48736

benwtrent merged commit 5fd2918 into elastic:master Apr 20, 2020

benwtrent deleted the feature/ml-inference-stats-race branch April 20, 2020 19:26

benwtrent mentioned this pull request Apr 20, 2020

[7.x] [ML] Fixing inference stats race condition (#55163) #55486

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

benwtrent commented Apr 14, 2020

Uh oh!

elasticmachine commented Apr 14, 2020

Uh oh!

Uh oh!

Uh oh!

benwtrent commented Apr 16, 2020

Uh oh!

benwtrent commented Apr 16, 2020

Uh oh!

benwtrent commented Apr 16, 2020

Uh oh!

benwtrent commented Apr 16, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

przemekwitek left a comment

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Apr 17, 2020

Uh oh!

przemekwitek Apr 20, 2020

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Apr 20, 2020

Uh oh!

benwtrent commented Apr 20, 2020

Uh oh!

benwtrent commented Apr 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants