Expand the lifecycle of the AggregationContext#94023
Expand the lifecycle of the AggregationContext#94023not-napoleon merged 20 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
1 similar comment
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
| readerContext.setRescoreDocIds(rescoreDocIds); | ||
| // Since we are returning the raw QuerySearchResult here, there's no opportunity for the object that will be holding that | ||
| // reference to incRef it before our try block closes and decRef's it when it closes the search context | ||
| context.queryResult().incRef(); |
There was a problem hiding this comment.
I'm a little uncomfortable with this, as it makes a lot of assumptions about our caller. But I don't see a way around that, since we'll decRef when the try-with-resources closes, and if that drops us to 0 references (which it should), we'll release the agg context.
There was a problem hiding this comment.
It's common, and totally reasonable to assume (require?) that the caller will take ownership of the returned value, and release it when no longer needed. I think it's not clear from the code that this is required, so it might be worth documenting it in this method's JavaDocs. I would recommend against the inline comments here however, stuff like this seems useful when you write it but tends to just become clutter over time.
I think the caller does behave as expected in this case, right? It looks like we pass it (pretty) directly to a ChannelActionListener which does decRef() once it no longer needs it. (aside: we really should document this)
There's definitely a risk of leaks with this kind of API tho, so it's important that a leak in this area would be caught by tests. Is that the case here? For instance, if it's using potentially-recycled pages then we should be using a CountingPageCacheRecycler (preferred) or MockPageCacheRecycler (relies on GC).
There was a problem hiding this comment.
Yes, the caller does the right thing and the tests do catch it if we don't. Honestly, the testing around this has been very good so far.
And you're right, I should just write javadoc for these methods. I hadn't because I've really only been focusing on this one aspect, which is not the main thrust of what's going on here at all, but they currently have none at all, so I'm sure I can do better than that.
There was a problem hiding this comment.
Even thought this method is private, it makes sense to document the responsibility of the caller at the method level.
| readerContext.setRescoreDocIds(rescoreDocIds); | ||
| // Since we are returning the raw QuerySearchResult here, there's no opportunity for the object that will be holding that | ||
| // reference to incRef it before our try block closes and decRef's it when it closes the search context | ||
| searchContext.queryResult().incRef(); |
There was a problem hiding this comment.
As noted above, this is uncomfortable but I don't have a better solution.
There was a problem hiding this comment.
ScrollQueryFetchSearcResult is a thin wrapper over QueryFetchSearchResult, so it seems reasonable to delegate ref counting to the wrapped reference.
|
There's a lot of places where things reach through There's also the rather terrifying Finally, |
|
@elasticmachine run elasticsearch-ci/docs |
I think just before overwriting each releasable/refcounted field it should close/decRef the current value. Unless there's some (non-obvious-to-me) invariant which means that we're never meaningfully overwriting anything. But in that case we should be able to add assertions to that effect at least. |
martijnvg
left a comment
There was a problem hiding this comment.
I left a few small questions/comments. In general this looks good to me.
|
|
||
| private final RefCounted refCounted; | ||
|
|
||
| private List<Releasable> toRelease; |
There was a problem hiding this comment.
I think toRelease can be final?
| } | ||
|
|
||
| public void addReleasable(Releasable releasable) { | ||
| toRelease.add(releasable); |
There was a problem hiding this comment.
Are we sure toRelease is always assigned when this method is invoked?
There was a problem hiding this comment.
Currently? I'm pretty sure, but it's tricky. Right now, this class has basically three "modes" - data node, coordinating node, and null. The data node constructor path correctly initializes toRelease, but neither of the other to do. Coordinating mode could initialize it, and the associated ref counting, but we don't do anything with it there yet. The null case probably shouldn't? I'm not entirely clear on why we need that path, but it's definitely used. Seemed safer to throw (NPE) if we tried to add a releasable in a path we weren't expecting it, but I'm open to discussion if you'd rather do something else.
| messageFields.put("elasticsearch.slowlog.took_millis", TimeUnit.NANOSECONDS.toMillis(tookInNanos)); | ||
| if (context.queryResult().getTotalHits() != null) { | ||
| messageFields.put("elasticsearch.slowlog.total_hits", context.queryResult().getTotalHits()); | ||
| if (context.getTotalHits() != null) { |
There was a problem hiding this comment.
Just double checking, but the change to this file doesn't seem to be related with the core of this PR?
| } | ||
| } | ||
|
|
||
| public void releaseAggregationContext() { |
There was a problem hiding this comment.
I think this method can have a private visibility? Since it is only called from line 104?
| readerContext.setRescoreDocIds(rescoreDocIds); | ||
| // Since we are returning the raw QuerySearchResult here, there's no opportunity for the object that will be holding that | ||
| // reference to incRef it before our try block closes and decRef's it when it closes the search context | ||
| context.queryResult().incRef(); |
There was a problem hiding this comment.
Even thought this method is private, it makes sense to document the responsibility of the caller at the method level.
|
@elasticmachine update branch |
|
That test failure reproduced in main. I've opened #95386 to investigate, but I don't think it needs to block this PR. I'm going to re-run it to get a different random seed. |
|
@elasticmachine run elasticsearch-ci/part-1 |
…95860) Refactorings done in #94023. improved memory management but also made it possible to run into a race condition in DelegatingCircuitBreaker. This change makes DelegatingCircuitBreaker thread-safe. This change is a quick-fix with low risk. Eventually DelegatingCircuitBreaker should be removed. Work on that is underway: #89437. Under that circumstances the change is reasonable to me. The added locking overhead shouldn't be a problem, re-allocations happen rarely and there is just the disconnect case where 2 threads potentially access the structure at the same time. fixes #95681
…lastic#95860) Refactorings done in elastic#94023. improved memory management but also made it possible to run into a race condition in DelegatingCircuitBreaker. This change makes DelegatingCircuitBreaker thread-safe. This change is a quick-fix with low risk. Eventually DelegatingCircuitBreaker should be removed. Work on that is underway: elastic#89437. Under that circumstances the change is reasonable to me. The added locking overhead shouldn't be a problem, re-allocations happen rarely and there is just the disconnect case where 2 threads potentially access the structure at the same time. fixes elastic#95681
…95860) (#95879) Refactorings done in #94023. improved memory management but also made it possible to run into a race condition in DelegatingCircuitBreaker. This change makes DelegatingCircuitBreaker thread-safe. This change is a quick-fix with low risk. Eventually DelegatingCircuitBreaker should be removed. Work on that is underway: #89437. Under that circumstances the change is reasonable to me. The added locking overhead shouldn't be a problem, re-allocations happen rarely and there is just the disconnect case where 2 threads potentially access the structure at the same time. fixes #95681
This PR replaces and supersedes #93594
Relates to #89437
One big lesson we learned from the first prototype of a dense memory aggregations format, in order for aggregation
BigArraysto live into the serialization phase on the data nodes, we needed to not release the circuit breaker at the end of collection. TheAggregationContextcurrently manages all of aggregations memory. Rather than change that, this PR extends the life cycle of theAggregationContextso it isn't closed untilQuerySearchResultis closed, at which point we have serialized the aggregation information back to the coordinating node.Importantly, this PR enables ref counting on
QuerySearchResult, in what I think is a straightforward way. My earlier attempt, linked above, delegated that ref counting to a wrapped releasable, but ran into problems because theQuerySearchResultgets created before the things it needs to release. This iteration delegates directly to a ref counter, and maintains a list of items to release, which currently is just the aggregation context. That list can be appended to after creation, thus resolving the timing issue.