Closed
Conversation
Member
Previously, we would keep references to the old datums set in the datum-backed vectors until the datums are overwritten, thus, extending the time period when the no-longer-used datums are live unnecessarily. We don't have to do that for any other types because for others we can actually reuse the old space for new elements, so we want to keep them around (e.g. decimals can reuse the non-inlined coefficient). This commit makes it so that we deeply unset the datums in `ResetInternalBatch`. Care had to be taken to keep the memory accounting up-to-date. In particular, after deeply resetting the datum-backed vector, we want to release the memory allocation that was taken up by the actual datums while keeping the overhead of `tree.Datum` interface in (since `[]tree.Datum` slice is still fully live). Release note: None
In the cFetcher, the hash aggregator, and the ordered synchronizer we are performing careful memory accounting in order to limit the size of the output batches. However, we had an incorrect assumption that could lead to batches being larger than desired. Previously, the first time the batch exceeded the target size, we would remember its "capacity" (i.e. the number of rows that fit into that batch), and in the future we would always put up to that number of rows in the batch. That works well when each row is of roughly the same size throughout the lifetime of an operator; however, that is not always the case. Imagine we have 1000 small rows followed by 1000 large rows - previously, all 1000 large rows would be put into a single batch, significantly exceeding the target size. This commit makes the limiting much more sane - it removes the notion of "max capacity" and, instead, after each row is set in the batch we check whether the batch has exceeded the target size. Release note: None
1ae1f64 to
4fe0096
Compare
The problem in the previous commit is that if bytes-like and/or decimal vectors are present, and once the footprint of the batch exceeds the target size because of bytes-like/decimal vectors, then after `ResetMaybeReallocate` call `AccountForSet` will only allow a single row. The problem is that we're holding onto old bytes/decimal values in order to reuse their space, but then our heuristic of no longer allowing more Sets once the reported usage exceeds the target size doesn't work. One idea is to not account for the bytes-like vectors in `ResetMaybeReallocate` and then use proportional size. The problem with this is that we're obviously not accounting for possibly large allocations (for some time). Another idea is to only apply the heuristic of making the batch full if the reported memory usage exceeds the target size IFF there was an increase in memory usage. This works well for datum-backed vectors. For bytes-like/decimals, however, we could have a pathological behavior where the sizes of the batches keep on growing until `coldata.BatchSize()` capacity is reached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
colexec: deeply reset datum-backed vectors in ResetInternalBatch
Previously, we would keep references to the old datums set in the
datum-backed vectors until the datums are overwritten, thus, extending
the time period when the no-longer-used datums are live unnecessarily.
We don't have to do that for any other types because for others we can
actually reuse the old space for new elements, so we want to keep them
around (e.g. decimals can reuse the non-inlined coefficient). This
commit makes it so that we deeply unset the datums in
ResetInternalBatch.Care had to be taken to keep the memory accounting up-to-date. In
particular, after deeply resetting the datum-backed vector, we want to
release the memory allocation that was taken up by the actual datums
while keeping the overhead of
tree.Datuminterface in (since[]tree.Datumslice is still fully live).Release note: None
colexec: fix limiting the output batch in size in several operators
In the cFetcher, the hash aggregator, and the ordered synchronizer we
are performing careful memory accounting in order to limit the size of
the output batches.
However, we had an incorrect assumption that could lead to batches
being larger than desired. Previously, the first time the batch exceeded
the target size, we would remember its "capacity" (i.e. the number of
rows that fit into that batch), and in the future we would always put up
to that number of rows in the batch. That works well when each row is of
roughly the same size throughout the lifetime of an operator; however,
that is not always the case. Imagine we have 1000 small rows followed by
1000 large rows - previously, all 1000 large rows would be put into
a single batch, significantly exceeding the target size.
This commit makes the limiting much more sane - it removes the notion of
"max capacity" and, instead, after each row is set in the batch we check
whether the batch has exceeded the target size.
Release note: None