distsqlrun: lookup join still unboundedly buffers, but this time in a different way

#35950 fixed one kind of unbounded buffering in lookup join - the kind caused by the KV layer, when somebody asks for a span that has an unbounded number of keys inside, without a batch limit.

But this didn't remove the potential for the `joinReader` to buffer in an unbounded manner, unfortunately!

The proximal cause is that `joinReader` doesn't use memory monitored containers. This should be a simple fix, to use a memory monitored container instead of a regular slice. At the very least, we should account for the row copies we perform when buffering output rows.

The root cause is that the algorithm that the `joinReader` uses to ensure that the output is ordered and that unmatched rows are accounted for (in the case of left outer and anti join) requires that *all* results for an input batch are present before emitting *any* results. This is problematic for an example where a left-side table has a single row that matches a billion rows on a right-side table - the algorithm won't emit anything until all billion rows are buffered!

I'm not sure if this algorithm can be modified to be streaming. It currently needs to see results for the whole batch because it can't guarantee that the results from the KV layer come in order, so it can't declare any of the input rows as "closed" at any point and let them get emitted. It's possible there's a way to work around this limitation, but it interacts in a funny way with the solution for #35950, which actually *sorts* all of the input row spans before sending them out, since KV requires it. This permitted limited KV batches, but it's painted us into a corner. 

Imagine a scenario with two tables, `a` and `b`, where `a` contains the integers `0-100` and each value of `a` has 1 million rows in `b`. Then, feed a `joinReader` all of the rows in `a` in reversed sort order (starting `100, 99, 98`...)! Now, we're beholden to output matches for `100` first, since it came first to the `joinReader` and we must output matches in order, but to output any matches for row `100` of `a` requires that all matches from `0-99` be buffered first, since we sorted the spans first and will therefore get matches from `0` first, then `1`, then `2`...

So this will require us to buffer `100 * 1 million` rows into memory, kaboom.

I believe that a solution for this requires that we upgrade the batch limit capabilities for the KV layer to not require sorted input spans, but I'm open to other ideas.

cc @solongordon @RaduBerinde 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions