Improve EQL Sequence circuit breaker precision#88538
Improve EQL Sequence circuit breaker precision#88538luigidellaquila merged 7 commits intoelastic:mainfrom
Conversation
| * Returns false if the process needs to be stopped. | ||
| */ | ||
| boolean match(int stage, Iterable<Tuple<KeyAndOrdinal, HitReference>> hits) { | ||
| long ramBytesUsedInFlight = ramBytesUsedInFlight(); |
There was a problem hiding this comment.
These values are > 0, if they reduce for some reason during the execution, at the end the memory diff will be negative.
Setting them to 0 as a global value, and avoiding to re-calculate them at the beginning of match(), guarantees that the memory added to the circuit breaker is always positive (unless Accountable objects return negative values, but checking the implementations it's not the case)
| } | ||
|
|
||
| private void addMemory(long bytes, String label) { | ||
| totalRamBytesUsed += bytes; |
There was a problem hiding this comment.
In case of breaker failure, the memory is not added to the circuit breaker, so the addition to totalRamBytesUsed should be done only after addEstimateBytesAndMaybeBreak succeeds.
There was a problem hiding this comment.
This is a good point, should this be relatively easy reproducible?
|
Pinging @elastic/es-ql (Team:QL) |
|
Hi @luigidellaquila, I've created a changelog YAML for you. |
| long bytesDiff = ramBytesUsedInFlight() - prevRamBytesUsedInFlight; | ||
| addMemory(bytesDiff, CB_INFLIGHT_LABEL); | ||
| prevRamBytesUsedInFlight += bytesDiff; |
There was a problem hiding this comment.
This'll leave prevRamBytesUsedInFlight == ramBytesUsedInFlight(), right? It might be clearer smth like:
| long bytesDiff = ramBytesUsedInFlight() - prevRamBytesUsedInFlight; | |
| addMemory(bytesDiff, CB_INFLIGHT_LABEL); | |
| prevRamBytesUsedInFlight += bytesDiff; | |
| long newRamBytesUsedInFlight = ramBytesUsedInFlight(); | |
| addMemory(newRamBytesUsedInFlight - prevRamBytesUsedInFlight, CB_INFLIGHT_LABEL); | |
| prevRamBytesUsedInFlight = newRamBytesUsedInFlight; |
Setting them to 0 as a global value, and avoiding to re-calculate them at the beginning of match(), guarantees that the memory added to the circuit breaker is always positive
Can the memory footprint not go down (trimmed?) in-between match() invocations? As that would still leave room for negative memory adjustments.
There was a problem hiding this comment.
That's exactly the reason for this change: if we calculate the memory footprint change as the difference between the beginning and the end of match() execution, we risk to miss everything that happens outside of that.
Eg.
- a match() is invoked, it calculates a delta of 10k: the real memory footprint is 10KB, the CB is updated with 10KB
- a trim() is invoked: the new real memory footprint is 0KB (the trim released all the memory), the CB is still 10KB (!!!) as
trim()does not update the CB - a match is invoked, it calculates another delta of 10k: the real memory footprint is 10KB again (the memory was cleared by the trim()), but the CB is updated to 20KB (!!!) because the new delta (10KB) is added to the CB value (10KB, not updated by trim)
Repeat this pattern for 100 times and at the end you'll have 10KB of real memory used, but the CB thinks you are using 1MB
The same could happen at the opposite, eg. something (outside of match() execution) allocates memory, but it does not update the CB (so the CB value is still zero!!!); after that, a match() execution REDUCES the memory footprint, so the delta is NEGATIVE. As soon as match() tries to update the CB with a negative value, you have the error we are chasing.
Not sure how likely it is to happen with current implementation, but I see it as a weakness of current algorithm.
With a global variable (that is what this PR does) you solve all of this: you just calculate the total memory every time and you compare it with the total memory calculated at previous round. As long as Accountable objects cannot return negative values, you are always guaranteed that the CB cannot go below zero
| prevRamBytesUsedInFlight = 0; | ||
| prevRamBytesUsedCompleted = 0; | ||
| totalRamBytesUsed = 0; |
There was a problem hiding this comment.
If keeping track of the in-flight and completed usage, totalRamBytesUsed can be removed (as it should always be the sum of the two).
|
|
||
| private boolean headLimit = false; | ||
|
|
||
| // ---------- CIRCUIT BREAKER ----------- |
There was a problem hiding this comment.
Nit: imo, this could be a single liner // circuit breaker or // circuit breaker accounting
…_breaker_precision' into improve_eql_seq_circuit_breaker_precision
|
Finally reproduced (and added a test case), the problem happened on |
astefan
left a comment
There was a problem hiding this comment.
LGTM. Good job on creating a test for this exact (hard to investigate) failure.
I've left two small comments.
| } | ||
| } | ||
|
|
||
| public void testParentCircuitBreakerOnClean() { |
There was a problem hiding this comment.
The test name imo is not descriptive enough. I'd suggest something like testEqlCBCleanedUp_on_ParentCBBreak.
Also, please add a comment to the method:
// test covering fix for https://github.com/elastic/elasticsearch/issues/88300
| public void testParentCircuitBreakerOnClean() { | ||
| int sequenceFiltersCount = 2; | ||
| final int SEARCH_REQUESTS_EXPECTED_COUNT = 2; | ||
| List<BreakerSettings> eqlBreakerSettings = Collections.singletonList( |
There was a problem hiding this comment.
The eqlBreakerSettings can be extracted in a separate method. This code block is used in two places in this class.
| @@ -305,34 +305,42 @@ public void clear() { | |||
| clearCircuitBreaker(); | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Add a comment to these two methods stating that they are protected for testing purposes.
bpintea
left a comment
There was a problem hiding this comment.
Nice catch.
I've left one comment, otherwise lgtm!
| if (CB_COMPLETED_LABEL.equals(label)) { | ||
| prevRamBytesUsedCompleted += bytes; | ||
| } else { | ||
| prevRamBytesUsedInFlight += bytes; | ||
| } |
There was a problem hiding this comment.
This isn't needed, as the values are assigned after the function call to the actual memory consumption.
There was a problem hiding this comment.
Good point, it's completely redundant. We don't even need an addMemory() method actually
|
|
||
| public void testParentCircuitBreakerOnClean() { | ||
| int sequenceFiltersCount = 2; | ||
| final int SEARCH_REQUESTS_EXPECTED_COUNT = 2; |
There was a problem hiding this comment.
Nit: I guess this is c&p'd and that's why it's final/capitalized, but it looks a bit off that, comparatively, the var above is only effectively final and not actually so.
|
@elasticmachine update branch |
Fixes #88300
The problem seems to be very sporadic though, I couldn't find a reproducer.There are two good candidates as a cause for this problem, this PR addresses both:
match()execution, but memory consumption change also happens outside that method and the initial value could be greater than the final one, so the delta added to the CB could be negative. In general, this second approach makes the memory calculation more accurate.[edit] It turned out the first hypothesis was correct: the problem was on cleanup.