Skip to content

Make Releasing Aggregation Buffers Safer#76741

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:aggs-buffer-leaks
Aug 23, 2021
Merged

Make Releasing Aggregation Buffers Safer#76741
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:aggs-buffer-leaks

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

Fixing a few possible leaks and making the logic more obviously correct.
Did not label this >bug as non of these failure scenarios are confirmed.

Fixing a few possible leaks and making the logic more obviously correct.
@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 20, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

runningTask.compareAndSet(task, null);
onPartialMergeFailure.accept(exc);
List<MergeTask> toCancels = new ArrayList<>();
final MergeTask task = runningTask.getAndSet(null);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems to me to be the most likely place where something could go wrong and we should really guard against a cancel throwing in the loop.

}
success = true;
} finally {
if (success == false) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole thing is admittedly a little questionable, but IMO we shouldn't have an obvious leak during serialization bugs (that could always be triggered by some other bug in network buffer handling or so).

@arteam arteam self-requested a review August 20, 2021 09:12
}
success = true;
} finally {
if (success == false) {
Copy link
Copy Markdown
Contributor

@arteam arteam Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we want to clean the aggregations only in case of an error, wouldn't it be better to do that in a catch block, instead of creating an additional variable and tracking its status in finally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured this was the approach we used in most other places and it helps in so far (relative to catching) that we wouldn't leak here if we hit an error (most likely by tripping an assertion) and thus make test debugging easier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add that not only does this pattern deal with the Error case without having to catch Throwable but it's also robust against future changes that introduce an early exit somewhere (e.g. a try-catch-log-and-return with smaller scope). Seems unlikely here but it's a good pattern anyway, I much prefer to signal the change of ownership explicitly like this than to do it implicitly in an exception handler.

toCancels.add(task);
toCancels.add(task::cancel);
}
for (MergeTask mergeTask : queue) {
Copy link
Copy Markdown
Contributor

@arteam arteam Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could clear the queue and iterate it one go here with

while (queue.isEmpty() == false){
      toCancels.add(queue.removeFirst()::cancel);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure saves a line :) I suppose we can even just pollFirst in a loop :) Will do.

onPartialMergeFailure.accept(exc);
List<MergeTask> toCancels = new ArrayList<>();
final MergeTask task = runningTask.getAndSet(null);
final List<Releasable> toCancels = new ArrayList<>();
Copy link
Copy Markdown
Contributor

@arteam arteam Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to name the list releasables instead of toCancels? It sounds to me more logical because its consists from elements that implement Relesable not Closeable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, left the name as is for now just because it's mostly cancelling here, probably doesn't change a whole lot right? :)

@arteam
Copy link
Copy Markdown
Contributor

arteam commented Aug 20, 2021

General comment: I think it would be nice to have additional tests in QueryPhaseResultConsumer for the failing merge part. It doesn't seem to be currently tested at all. But we could do that in a separate PR.

@nik9000
Copy link
Copy Markdown
Member

nik9000 commented Aug 20, 2021

I tried looking at this when I woke up this morning at 6:30 and just couldn't. QueryPhaseController needs full awake, I guess. I've added this to my list to look at too. I know things about QueryPhaseController - but a lot of what I know is "don't trust your instincts, it's always more complicated than you think".

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little more paranoia. I like that.

Is it worth adding a unit test that we release things in the way we expect? My instinct is "yes", but if the unit it test is totally unreal it'd be more of pain than it's worth. Not sure. Depends on the mocking load, I guess.

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Artem + Nik!

Is it worth adding a unit test that we release things in the way we expect? My instinct is "yes", but if the unit it test is totally unreal it'd be more of pain than it's worth. Not sure. Depends on the mocking load, I guess.

I'll try to come up with a clean follow-up that stresses the error handling of this logic a little later in the week (can't do it quickly here I'm afraid, the mocking at least with my level of understanding of this code will take a little). I think you two are def. right that we should have coverage though.

@original-brownbear original-brownbear merged commit 57ba73c into elastic:master Aug 23, 2021
@original-brownbear original-brownbear deleted the aggs-buffer-leaks branch August 23, 2021 17:51
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Aug 24, 2021
* master: (21 commits)
  [Test] More robust assertions for sorting and pagination (elastic#76654)
  [Test] Fix filename check on Windows (elastic#76807)
  Upgrade build scan plugin to 3.6.4 (elastic#76784)
  Remove keystore initial_md5sum (elastic#76835)
  Don't export docker images on assemble (elastic#76817)
  Fix testMasterStatsOnSuccessfulUpdate (elastic#76844)
  AwaitsFix for elastic#76840
  Make Releasing Aggregation Buffers Safer (elastic#76741)
  Re-enable BWC tests after backport of elastic#76771 (elastic#76839)
  Dispatch large bulk requests to write thread  (elastic#76736)
  Disable BWC tests for elastic#76771
  Pull down beats artifacts when performing release tests
  Add timing stats to publication process (elastic#76771)
  Fix BanFailureLoggingTests some more (elastic#76668)
  Mention "warn threshold" in master service slowlog (elastic#76815)
  Fix DockerTests.test010Install
  Re-enable tests affected by elastic#75097 (elastic#76814)
  Fix testRecoveryIsCancelledAfterDeletingTheIndex (elastic#76644)
  Test fix -WildcardFieldMapperTests bad test data. (elastic#76819)
  Updating supported version after backporting the feature (elastic#76794)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java
original-brownbear added a commit that referenced this pull request Aug 24, 2021
Fixing a few possible leaks and making the logic more obviously correct.
original-brownbear added a commit that referenced this pull request Aug 24, 2021
Fixing a few possible leaks and making the logic more obviously correct.
@original-brownbear original-brownbear restored the aggs-buffer-leaks branch April 18, 2023 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.15.0 v7.16.0 v8.0.0-alpha2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants