Skip to content

Long key bucket ords key iterator#95809

Merged
not-napoleon merged 7 commits intoelastic:mainfrom
not-napoleon:long-key-bucket-ords-key-iterator
May 15, 2023
Merged

Long key bucket ords key iterator#95809
not-napoleon merged 7 commits intoelastic:mainfrom
not-napoleon:long-key-bucket-ords-key-iterator

Conversation

@not-napoleon
Copy link
Copy Markdown
Member

@not-napoleon not-napoleon commented May 3, 2023

Relates to #89437

In order to merge aggregations using LongKeyedBucketOrds, we need a way to align the buckets by key. This adds an iterator to return all keys for a given owning bucket ordinal, in natural sorted order. This can then be used to merge without generating and sorting bucket objects directly.

This makes two big assumptions:

1 - the total set of keys is small enough that we don't need to put the key set in a Big Arrays backed data structure
2 - Building the tree set of keys (at reduce time) will not be excessively expensive.

We can mitigate 1 if we build a big arrays backed balanced binary tree. I didn't want to do that if I don't have to.

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 3, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)


public Iterator<Long> keyOrderedIterator(long owningBucketOrd) {
if (keySet == null) {
keySet = new TreeSet<>();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also build this as we're adding values, but since add is very hot, it seemed safer to do it out of that loop.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also don't always call keyOrderedIterator.

Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach feels fine to me. I recommended an array and a different iterator, but otherwise I'm good.

*/
public abstract BucketOrdsEnum ordsEnum(long owningBucketOrd);

public Iterator<Long> keyOrderedIterator(long owningBucketOrd) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if PrimitiveIterator.OfLong is better. I think it's about the same to implement it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And could you add javadocs? I am still trying to digest what is going on here :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sorry, I should have marked this "draft". Was really just pushing it for some early feedback and CI testing. I'll add java doc today.

});
}

private TreeSet<Long> keySet = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is set you can't add any more things to it. It's probably worth asserting that.

I wonder if this is better as a long[] and then you sort it. That feels lighter than all of the tree stuff.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about a long[], but we'd still need duplicate removal. Which is doable, but seemed like enough complexity to justify the tree set. Open to trying it that way though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we need duplicate removal at insert time, otherwise it ends up being just as big as the ords array, basically by definition

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to nullify this variable during release?

@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 4, 2023

I think this is fine, we might realise later that we need to optimise it / make it safer but as as POC for aggregation reduction looks good to me.

@not-napoleon not-napoleon requested review from iverase and nik9000 May 10, 2023 19:15
@not-napoleon
Copy link
Copy Markdown
Member Author

@iverase - yeah, I spent basically zero time on optimizing this. I did open a couple of tickets for low hanging fruit optimizations: #95961 and #95960

There's definitely more we could do, but I don't think it's worth investing too much into it until we can see where the bottle necks on the new reduce logic are.


public Iterator<Long> keyOrderedIterator(long owningBucketOrd) {
if (keySet == null) {
keySet = new TreeSet<>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also don't always call keyOrderedIterator.

Copy link
Copy Markdown
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @not-napoleon!

@not-napoleon not-napoleon merged commit 4f4614a into elastic:main May 15, 2023
@not-napoleon not-napoleon deleted the long-key-bucket-ords-key-iterator branch May 15, 2023 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants