-
Notifications
You must be signed in to change notification settings - Fork 4.1k
distsqlrun: lookup join still unboundedly buffers, but this time in a different way #39044
Description
#35950 fixed one kind of unbounded buffering in lookup join - the kind caused by the KV layer, when somebody asks for a span that has an unbounded number of keys inside, without a batch limit.
But this didn't remove the potential for the joinReader to buffer in an unbounded manner, unfortunately!
The proximal cause is that joinReader doesn't use memory monitored containers. This should be a simple fix, to use a memory monitored container instead of a regular slice. At the very least, we should account for the row copies we perform when buffering output rows.
The root cause is that the algorithm that the joinReader uses to ensure that the output is ordered and that unmatched rows are accounted for (in the case of left outer and anti join) requires that all results for an input batch are present before emitting any results. This is problematic for an example where a left-side table has a single row that matches a billion rows on a right-side table - the algorithm won't emit anything until all billion rows are buffered!
I'm not sure if this algorithm can be modified to be streaming. It currently needs to see results for the whole batch because it can't guarantee that the results from the KV layer come in order, so it can't declare any of the input rows as "closed" at any point and let them get emitted. It's possible there's a way to work around this limitation, but it interacts in a funny way with the solution for #35950, which actually sorts all of the input row spans before sending them out, since KV requires it. This permitted limited KV batches, but it's painted us into a corner.
Imagine a scenario with two tables, a and b, where a contains the integers 0-100 and each value of a has 1 million rows in b. Then, feed a joinReader all of the rows in a in reversed sort order (starting 100, 99, 98...)! Now, we're beholden to output matches for 100 first, since it came first to the joinReader and we must output matches in order, but to output any matches for row 100 of a requires that all matches from 0-99 be buffered first, since we sorted the spans first and will therefore get matches from 0 first, then 1, then 2...
So this will require us to buffer 100 * 1 million rows into memory, kaboom.
I believe that a solution for this requires that we upgrade the batch limit capabilities for the KV layer to not require sorted input spans, but I'm open to other ideas.