Skip to content

[ML] Fixing inference stats race condition#55163

Merged
benwtrent merged 21 commits intoelastic:masterfrom
benwtrent:feature/ml-inference-stats-race
Apr 20, 2020
Merged

[ML] Fixing inference stats race condition#55163
benwtrent merged 21 commits intoelastic:masterfrom
benwtrent:feature/ml-inference-stats-race

Conversation

@benwtrent
Copy link
Copy Markdown
Member

updateAndGet could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
* @param updateFunction a side-effect-free function
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a ReadWriteLock. The LongAdder objects allows fast thread safe writes in high contention environments. These can be protected by the ReadWriteLock::readLock.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with ReadWriteLock::writeLock.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes #54786

@benwtrent benwtrent added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.0.0 v7.8.0 labels Apr 14, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/2

3 similar comments
@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/2

@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/2

@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/2

Copy link
Copy Markdown

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/2

statsResponse = client().performRequest(new Request("GET", "_ml/inference/" + regressionModelId + "/_stats"));
assertThat(EntityUtils.toString(statsResponse.getEntity()), containsString("\"inference_count\":10"));
} catch (ResponseException ex) {
//this could just mean shard failures.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this "catch" clause call "Assert.fail()"?
If it doesn't I think the whole block won't be retried when the ResponseException happens...

@benwtrent
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Copy Markdown
Member Author

run elasticsearch-ci/1

@benwtrent benwtrent merged commit 5fd2918 into elastic:master Apr 20, 2020
@benwtrent benwtrent deleted the feature/ml-inference-stats-race branch April 20, 2020 19:26
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Apr 20, 2020
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes  elastic#54786
benwtrent added a commit that referenced this pull request Apr 20, 2020
`updateAndGet` could actually call the internal method more than once on contention.
If I read the JavaDocs, it says:
```* @param updateFunction a side-effect-free function```
So, it could be getting multiple updates on contention, thus having a race condition where stats are double counted.

To fix, I am going to use a `ReadWriteLock`. The `LongAdder` objects allows fast thread safe writes in high contention environments. These can be protected by the `ReadWriteLock::readLock`.

When stats are persisted, I need to call reset on all these adders. This is NOT thread safe if additions are taking place concurrently. So, I am going to protect with `ReadWriteLock::writeLock`.

This should prevent race conditions while allowing high (ish) throughput in the highly contention paths in inference.

I did some simple throughput tests and this change is not significantly slower and is simpler to grok (IMO).

closes  #54786
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:ml Machine learning >refactoring >test Issues or PRs that are addressing/adding tests v7.8.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] InferenceIngestIT.testPipelineIngest occasionally fails

5 participants