ESQL: Add a `HashLookupOperator` by nik9000 · Pull Request #107894 · elastic/elasticsearch

nik9000 · 2024-04-25T12:06:47Z

Adds and operator takes a Block on construction, builds a BlockHash,
and uses it to resolve keys to row offsets. Example time!

Say we hand construct the operator with the first two columns of this
Page:

a	b	v
1	11	21
2	12	22
2	14	23
2	11	24

If we then fire this the first two columns of this Page into it, we'll
get the third column:

a	b	ord
2	14	2
1	11	0
3	11	null
[1,2]	11	[0,3]

This is the first half of the of the Operator side of a hash join.
The second half is looking up values from those row offsets. That'd
mean adding the v column like so:

a	b	ord	v
2	14	2	23
1	11	0	21
3	11	null	null
[1,2]	11	[0,3]	[21,24]

And that is comparatively simple.

Notice that I said this is the Operator side of a hash join. There's
no planning or distributed execution involved. Yet. And a hash join is
something you'd distribute. This Operator can run on a data node or a
coordinating node. It doesn't care. It just needs an input.

Adds and operator takes a `Block` on construction, builds a `BlockHash`, and uses it to resolve keys to row offsets. Example time! Say we hand construct the operator with the first two columns of this `Page`: | a | b | v | | -:| --:| --:| | 1 | 11 | 21 | | 2 | 12 | 22 | | 2 | 14 | 23 | | 2 | 11 | 24 | If we then fire this the first two columns of this `Page` into it, we'll get the third column: | a | b | ord | | -----:| --:| -----:| | 2 | 14 | 2 | | 1 | 11 | 0 | | 3 | 11 | null | | [1,2] | 11 | [0,3] | This is the first half of the of the `Operator` side of a hash join. The second half is looking up values from those row offsets. That'd mean adding the `v` column like so: | a | b | ord | v | | -----:| --:| -----:| -------:| | 2 | 14 | 2 | 23 | | 1 | 11 | 0 | 21 | | 3 | 11 | null | null | | [1,2] | 11 | [0,3] | [21,24] | And *that* is comparatively simple. Notice that I said this is the *Operator* side of a hash join. There's no planning or distributed execution involved. Yet. And a hash join is something you'd distribute. This `Operator` can run on a data node or a coordinating node. It doesn't care. It just needs an input.

elasticsearchmachine · 2024-04-25T12:07:11Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2024-04-25T12:07:32Z

...in/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/BlockHash.java

+     */
+    public static BlockHash buildPackedValuesBlockHash(List<GroupSpec> groups, BlockFactory blockFactory, int emitBatchSize) {
+        return new PackedValuesBlockHash(groups, blockFactory, emitBatchSize);
+    }


I need to force using a PackedValuesBlockHash for a bit, until I work out how to handle the different ordinals. I didn't want to make the class public, so I plumbed it through here.

nik9000 · 2024-04-25T12:10:18Z

...ute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/PackedValuesBlockHash.java

    private final int nullTrackingBytes;
    private final BytesRefBuilder bytes = new BytesRefBuilder();
-    private final Group[] groups;
+    private final List<GroupSpec> specs;


Ok. This was really fun to figure out. If groups lives on the object then the add and lookup methods aren't reentrant on any particular hash. And lookup can't be thread safe. Now, I don't try to use it in a multi-threaded way, but I do call lookup from inside of a callback for add - at least in tests. And they worked! But in a really muddled, sneaky, scary way. This has us building the Group objects on the fly. This is nicer because I can make them Releasable. It's more objects to build on add, but it's safer. Less crazy making.

nik9000 · 2024-04-25T12:10:46Z

...ute/src/main/java/org/elasticsearch/compute/aggregation/blockhash/PackedValuesBlockHash.java


        AddWork(Page page, GroupingAggregatorFunction.AddInput addInput, int batchSize) {
            super(blockFactory, emitBatchSize, addInput);
-            initializeGroupsForPage(page, batchSize);


This whole method just moves into the ctor for Group.

nik9000 · 2024-04-25T12:11:33Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/Page.java

+                Releasables.close(mapped);
+            }
+        }
+    }


This is pretty much the core of the ProjectOperator lifted into a method on Page. I want to project during hash lookups.

I suppose I could avoid the projection by changing the hash lookup method, but I'm not sure it's worth doing.

nik9000 · 2024-04-25T12:15:24Z

.../src/main/java/org/elasticsearch/compute/operator/AbstractPageMappingToIteratorOperator.java

+/**
+ * Maps a single {@link Page} into zero or more resulting pages.
+ */
+public abstract class AbstractPageMappingToIteratorOperator implements Operator {


This implements most of the lookup operator. I made it a separate subclass because:

It's easier to test that way.

We're going to reuse it for stuff like loading columns.

We might be able to reuse it for mv_expand. I'm not sure.

I think it's easier to read this way - the bits about managing Iterator<Page> or merging Iterator<Block> into an existing page gets to live here and there's no need to think about where that iterator comes from. I genuinely can't hold the contents of BlockHash#lookup and this class in my had at the same time. So this thing just works with any Iterator.

nik9000 · 2024-04-25T12:16:49Z

...plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashLookupOperator.java

+            hash.add(new Page(keys), new GroupingAggregatorFunction.AddInput() {
+                @Override
+                public void add(int positionOffset, IntBlock groupIds) {
+                    // TODO support multiple rows with the same keys


This is a huge TODO, innocently tucked in here. And it dovetails into the block comment above about using different types for the lookup. I don't know how to solve this one at the moment. So I'm saving it fair a rainy day.

nik9000 · 2024-04-25T12:18:19Z

x-pack/plugin/esql/compute/src/test/java/org/elasticsearch/compute/OperatorTests.java

        return docIds;
    }

+    public void testHashLookup() {


I've not plugged the HashLoopupOperator into anything yet so good old OperatorTests will have to do. This has been here since long before I started working on ESQL and it's kind of a junk drawer. But it'll do until we get this wired in.

nik9000 · 2024-04-25T12:19:06Z

...n/esql/compute/src/test/java/org/elasticsearch/compute/operator/IteratorAppendPageTests.java

+ * subclass that appends {@code 1} and chunks the incoming {@link Page}
+ * at {@code 100} positions.
+ */
+public class IteratorAppendPageTests extends OperatorTestCase {


This is quite like the behavior of HashLookupOperator, but easier to reason about because the code fits on the screen....

nik9000 · 2024-04-25T12:19:56Z

...n/esql/compute/src/test/java/org/elasticsearch/compute/operator/IteratorRemovePageTests.java

+ * Tests {@link AbstractPageMappingToIteratorOperator} against a test
+ * subclass that removes every other page.
+ */
+public class IteratorRemovePageTests extends OperatorTestCase {


No other extensions to AbstractPageMappingToIteratorOperator can yield an empty Iterator. But this feels like an important edge case to cover.

nik9000 · 2024-04-25T12:20:53Z

...k/plugin/esql/compute/src/test/java/org/elasticsearch/compute/operator/OperatorTestCase.java

-            for (int i = 0; i < p.getBlockCount(); i++) {
-                resultBlocks.add(p.getBlock(i));
-            }
-


I ended up here because I was debugging something I'd broken, but this was hard to read because it was doing stuff it doesn't really have to do. That resultBlocks list was just built and thrown away. I don't recall what it was for, but we don't need it.

dnhatn

I like it. Thanks Nik!

dnhatn · 2024-04-25T20:37:34Z

...plugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/HashLookupOperator.java

+
+public class HashLookupOperator extends AbstractPageMappingToIteratorOperator {
+    /**
+     * Factory for {@link HashLookupOperator}. It's received {@link Block}s


Eventually, we need to release these blocks. Let's keep them as they are until we integrate with the planning.

Yeah! I've been playing around with limiting the size of the blocks and not tracking them. That's "simpler", but it's not good. It might be the right thing to do to get us moving.

This adds support for `LOOKUP`, a command that implements a sort of inline `ENRICH`, using data that is passed in the request: ``` $ curl -uelastic:password -HContent-Type:application/json -XPOST \ 'localhost:9200/_query?error_trace&pretty&format=txt' \ -d'{ "query": "ROW a=1::LONG | LOOKUP t ON a", "tables": { "t": { "a:long": [ 1, 4, 2], "v1:integer": [ 10, 11, 12], "v2:keyword": ["cat", "dog", "wow"] } }, "version": "2024.04.01" }' v1 | v2 | a ---------------+---------------+--------------- 10 |cat |1 ``` This required these PRs: * #107624 * #107634 * #107701 * #107762 * #107923 * #107894 * #107982 * #108012 * #108020 * #108169 * #108191 * #108334 * #108482 * #108696 * #109040 * #109045 Closes #107306

nik9000 added 2 commits April 25, 2024 07:57

Merge branch 'main' into map_to_iter

8672fb6

nik9000 added >non-issue :Analytics/ES|QL AKA ESQL v8.15.0 labels Apr 25, 2024

nik9000 requested a review from dnhatn April 25, 2024 12:06

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 25, 2024

nik9000 commented Apr 25, 2024

View reviewed changes

dnhatn approved these changes Apr 25, 2024

View reviewed changes

nik9000 merged commit 1507c87 into elastic:main Apr 26, 2024

nik9000 mentioned this pull request Apr 28, 2024

ESQL: Implement LOOKUP, an "inline" enrich #107987

Merged

Conversation

nik9000 commented Apr 25, 2024

Uh oh!

elasticsearchmachine commented Apr 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants