Skip to content

Improve EQL Sequence circuit breaker precision#88538

Merged
luigidellaquila merged 7 commits intoelastic:mainfrom
luigidellaquila:improve_eql_seq_circuit_breaker_precision
Aug 1, 2022
Merged

Improve EQL Sequence circuit breaker precision#88538
luigidellaquila merged 7 commits intoelastic:mainfrom
luigidellaquila:improve_eql_seq_circuit_breaker_precision

Conversation

@luigidellaquila
Copy link
Copy Markdown
Contributor

@luigidellaquila luigidellaquila commented Jul 14, 2022

Fixes #88300

The problem seems to be very sporadic though, I couldn't find a reproducer.

There are two good candidates as a cause for this problem, this PR addresses both:

  • The new memory calculation is added to the total count even if the CB fails, so a cleanup after a failure could lead to a negative value of the CB
  • The delta memory is calculated only between the beginning and the end of match() execution, but memory consumption change also happens outside that method and the initial value could be greater than the final one, so the delta added to the CB could be negative. In general, this second approach makes the memory calculation more accurate.

[edit] It turned out the first hypothesis was correct: the problem was on cleanup.

* Returns false if the process needs to be stopped.
*/
boolean match(int stage, Iterable<Tuple<KeyAndOrdinal, HitReference>> hits) {
long ramBytesUsedInFlight = ramBytesUsedInFlight();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are > 0, if they reduce for some reason during the execution, at the end the memory diff will be negative.
Setting them to 0 as a global value, and avoiding to re-calculate them at the beginning of match(), guarantees that the memory added to the circuit breaker is always positive (unless Accountable objects return negative values, but checking the implementations it's not the case)

}

private void addMemory(long bytes, String label) {
totalRamBytesUsed += bytes;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of breaker failure, the memory is not added to the circuit breaker, so the addition to totalRamBytesUsed should be done only after addEstimateBytesAndMaybeBreak succeeds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, should this be relatively easy reproducible?

@luigidellaquila luigidellaquila marked this pull request as ready for review July 14, 2022 12:24
@elasticmachine elasticmachine added the Team:QL (Deprecated) Meta label for query languages team label Jul 14, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:05
Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

Comment on lines +336 to +338
long bytesDiff = ramBytesUsedInFlight() - prevRamBytesUsedInFlight;
addMemory(bytesDiff, CB_INFLIGHT_LABEL);
prevRamBytesUsedInFlight += bytesDiff;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll leave prevRamBytesUsedInFlight == ramBytesUsedInFlight(), right? It might be clearer smth like:

Suggested change
long bytesDiff = ramBytesUsedInFlight() - prevRamBytesUsedInFlight;
addMemory(bytesDiff, CB_INFLIGHT_LABEL);
prevRamBytesUsedInFlight += bytesDiff;
long newRamBytesUsedInFlight = ramBytesUsedInFlight();
addMemory(newRamBytesUsedInFlight - prevRamBytesUsedInFlight, CB_INFLIGHT_LABEL);
prevRamBytesUsedInFlight = newRamBytesUsedInFlight;

Setting them to 0 as a global value, and avoiding to re-calculate them at the beginning of match(), guarantees that the memory added to the circuit breaker is always positive

Can the memory footprint not go down (trimmed?) in-between match() invocations? As that would still leave room for negative memory adjustments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly the reason for this change: if we calculate the memory footprint change as the difference between the beginning and the end of match() execution, we risk to miss everything that happens outside of that.
Eg.

  • a match() is invoked, it calculates a delta of 10k: the real memory footprint is 10KB, the CB is updated with 10KB
  • a trim() is invoked: the new real memory footprint is 0KB (the trim released all the memory), the CB is still 10KB (!!!) as trim() does not update the CB
  • a match is invoked, it calculates another delta of 10k: the real memory footprint is 10KB again (the memory was cleared by the trim()), but the CB is updated to 20KB (!!!) because the new delta (10KB) is added to the CB value (10KB, not updated by trim)

Repeat this pattern for 100 times and at the end you'll have 10KB of real memory used, but the CB thinks you are using 1MB

The same could happen at the opposite, eg. something (outside of match() execution) allocates memory, but it does not update the CB (so the CB value is still zero!!!); after that, a match() execution REDUCES the memory footprint, so the delta is NEGATIVE. As soon as match() tries to update the CB with a negative value, you have the error we are chasing.

Not sure how likely it is to happen with current implementation, but I see it as a weakness of current algorithm.

With a global variable (that is what this PR does) you solve all of this: you just calculate the total memory every time and you compare it with the total memory calculated at previous round. As long as Accountable objects cannot return negative values, you are always guaranteed that the CB cannot go below zero

Comment on lines 325 to 327
prevRamBytesUsedInFlight = 0;
prevRamBytesUsedCompleted = 0;
totalRamBytesUsed = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If keeping track of the in-flight and completed usage, totalRamBytesUsed can be removed (as it should always be the sum of the two).


private boolean headLimit = false;

// ---------- CIRCUIT BREAKER -----------
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: imo, this could be a single liner // circuit breaker or // circuit breaker accounting

@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@luigidellaquila
Copy link
Copy Markdown
Contributor Author

Finally reproduced (and added a test case), the problem happened on clearCircuitBreaker() after a circuit breaker failure happening in the parent circuit breaker service

Copy link
Copy Markdown
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good job on creating a test for this exact (hard to investigate) failure.
I've left two small comments.

}
}

public void testParentCircuitBreakerOnClean() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name imo is not descriptive enough. I'd suggest something like testEqlCBCleanedUp_on_ParentCBBreak.

Also, please add a comment to the method:
// test covering fix for https://github.com/elastic/elasticsearch/issues/88300

public void testParentCircuitBreakerOnClean() {
int sequenceFiltersCount = 2;
final int SEARCH_REQUESTS_EXPECTED_COUNT = 2;
List<BreakerSettings> eqlBreakerSettings = Collections.singletonList(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eqlBreakerSettings can be extracted in a separate method. This code block is used in two places in this class.

@@ -305,34 +305,42 @@ public void clear() {
clearCircuitBreaker();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to these two methods stating that they are protected for testing purposes.

Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.
I've left one comment, otherwise lgtm!

Comment on lines +318 to +322
if (CB_COMPLETED_LABEL.equals(label)) {
prevRamBytesUsedCompleted += bytes;
} else {
prevRamBytesUsedInFlight += bytes;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed, as the values are assigned after the function call to the actual memory consumption.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it's completely redundant. We don't even need an addMemory() method actually


public void testParentCircuitBreakerOnClean() {
int sequenceFiltersCount = 2;
final int SEARCH_REQUESTS_EXPECTED_COUNT = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I guess this is c&p'd and that's why it's final/capitalized, but it looks a bit off that, comparatively, the var above is only effectively final and not actually so.

@luigidellaquila
Copy link
Copy Markdown
Contributor Author

@elasticmachine update branch

@astefan astefan added the v8.3.4 label Jul 29, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
8.3
8.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

eql_sequence circuit breaker sometimes has negative used value

6 participants