coldata: change Bytes vector implementation#76202
coldata: change Bytes vector implementation#76202craig[bot] merged 5 commits intocockroachdb:masterfrom
Conversation
b31554a to
61c086d
Compare
|
Neat, what is the significant of 19 bytes? |
61c086d to
d8124de
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I added the rationale in the commit message (not yet to the PR description), but my thinking is that 16 bytes is the minimum acceptable (so that UUIDs are inlined), and we have at least 3 bytes of padding which gives us 19 bytes. We could add more padding in order to increase inlinable space at the cost of larger allocations, but at the moment a single element is 32 bytes which seems just right to me. It's a bit unfortunate that we cannot use 8 bytes from the slice header, but reusing SliceHeader.Data field leads to issues with the GC.
Reviewable status:
complete! 0 of 0 LGTMs obtained
e49e708 to
d406fb5
Compare
|
cc @nvanbenschoten in case you're interested in taking a look at the first commit |
d406fb5 to
807572b
Compare
|
I think we can basically ignore the regression in speed of serialization/deserialization since it's still in GB/s range, and I have some improvements for |
michae2
left a comment
There was a problem hiding this comment.
Spicy! 🌶️ 🙂
I buy your argument about the ArrowBatchConverter regressions. Being able to Bytes.Set() seems like a big win.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/col/coldata/bytes.go, line 61 at r2 (raw file):
// element was valid (namely, it'll contain a non-inlined value with an // empty []byte). nonInlined bool
You might be able to inline this, too. (Isn't data.Data always nil when the value is inlined and non-nil when the value is pointed to?)
pkg/col/coldata/bytes.go, line 62 at r2 (raw file):
// empty []byte). nonInlined bool lengthOrOffset int32
If this were only used by inlined data, you could also inline it. (E.g. you could use the first inline byte for the length of inlined data.)
pkg/col/coldata/bytes.go, line 164 at r2 (raw file):
e.data = b.buffer[e.lengthOrOffset : e.lengthOrOffset+int32(len(e.data)) : e.lengthOrOffset+int32(cap(e.data))] } }
If element.data were something like {Offset, Len, Cap} rather than {Ptr, Len, Cap} then you wouldn't need to do this update. And bonus, you might be able to use more of element for inlining. (This takes go GC out of the loop, of course, which could cause problems if we're relying on go GC to know when to free Bytes.buffer based on some element outliving its parent Bytes. But if an element never outlives its parent Bytes then it could be fine.)
pkg/col/coldata/bytes.go, line 175 at r2 (raw file):
type Bytes struct { // elements contains all values stored in this Bytes vector. elements []element
I see, you already implemented the "second layer" @msirek was talking about today with this elements slice. I wonder how much of the performance changes are due to being able to Bytes.Set() using this second layer, rather than due to inlining? The inlining seems like a small optimization on top of this more fundamental change... but maybe I'm misjudging it?
Just curious, did you happen to run any benchmarks with only the second layer, no inlining? If not, would you be willing to try some benchmarks with BytesMaxInlineLength = -1?
yuzefovich
left a comment
There was a problem hiding this comment.
Thanks for the feedback! I like your suggestions, currently have a WIP to implement them, but will need to polish the code more. I'll put this PR aside for a couple of days and will likely pick it back up on Friday.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
e513b29 to
97b3ac6
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I think it is RFAL. I added a separate commit to add some optimizations (kept separately for ease of reviewing). The performance regressions for Append/Copy have been reduced to an acceptable level (and there are some speedups too), here are the overall change of this PR on SortUUID and unordered distinct.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)
pkg/col/coldata/bytes.go, line 61 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
You might be able to inline this, too. (Isn't
data.Dataalways nil when the value is inlined and non-nil when the value is pointed to?)
Refactored.
pkg/col/coldata/bytes.go, line 62 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
If this were only used by inlined data, you could also inline it. (E.g. you could use the first inline byte for the length of inlined data.)
I decided to keep a single byte for inlinedLength separately to reduce the complexity.
pkg/col/coldata/bytes.go, line 164 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
If
element.datawere something like{Offset, Len, Cap}rather than{Ptr, Len, Cap}then you wouldn't need to do this update. And bonus, you might be able to use more ofelementfor inlining. (This takes go GC out of the loop, of course, which could cause problems if we're relying on go GC to know when to freeBytes.bufferbased on someelementoutliving its parentBytes. But if anelementnever outlives its parentBytesthen it could be fine.)
Nice! This is a very good point, and I implemented your suggestion. Now we have 30 bytes of inlinable space and don't have to update pointers for non-inlined elements when Bytes.buffer is reallocated.
We could have squeezed one more byte of inlinable space by merging inlinedLength and nonInlined boolean into a single byte, but I feel like that extra complexity is not worth it. Thoughts?
pkg/col/coldata/bytes.go, line 175 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
I see, you already implemented the "second layer" @msirek was talking about today with this
elementsslice. I wonder how much of the performance changes are due to being able toBytes.Set()using this second layer, rather than due to inlining? The inlining seems like a small optimization on top of this more fundamental change... but maybe I'm misjudging it?Just curious, did you happen to run any benchmarks with only the second layer, no inlining? If not, would you be willing to try some benchmarks with
BytesMaxInlineLength = -1?
My guess is that inlining is quite important. I'll run some benchmarks with inlining disabled next week.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)
pkg/col/coldata/bytes.go, line 175 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
My guess is that inlining is quite important. I'll run some benchmarks with inlining disabled next week.
Hm, the benchmarks here show that there are some improvements and some regressions without any inlining when comparing against the current tip of this PR.
I think maybe the best approach will be with no inlining, but with preallocating buffer with some fixed width (say 16 bytes for each element) right away. I'll prototype that at some point, so hold off on reviewing for now.
97b3ac6 to
7939ccd
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
@michae2 PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)
pkg/col/coldata/bytes.go, line 175 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, the benchmarks here show that there are some improvements and some regressions without any inlining when comparing against the current tip of this PR.
I think maybe the best approach will be with no inlining, but with preallocating
bufferwith some fixed width (say 16 bytes for each element) right away. I'll prototype that at some point, so hold off on reviewing for now.
My initial WIP had a bug where in the AppendSlice method which inflated the benchmark numbers. The updated numbers are here, and they are aligned with my expectation - the inlining is important, so I think the PR in its current form is the way to go.
3c4f066 to
11259dc
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Alright, done pushing, RFAL. The current benchmark numbers are here.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)
11259dc to
e5a83de
Compare
michae2
left a comment
There was a problem hiding this comment.
This is looking good. I'm still reading but have to pause for a bit.
Reviewed 6 of 15 files at r1, 1 of 17 files at r5, 5 of 8 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @yuzefovich)
pkg/col/coldata/bytes.go, line 164 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Nice! This is a very good point, and I implemented your suggestion. Now we have 30 bytes of inlinable space and don't have to update pointers for non-inlined elements when
Bytes.bufferis reallocated.We could have squeezed one more byte of inlinable space by merging
inlinedLengthandnonInlinedboolean into a singlebyte, but I feel like that extra complexity is not worth it. Thoughts?
This is fine. No need to squeeze any more byte juice.
nit: But it might be nice to reverse the meaning of the inlined boolean so that the zero element was a zero-length inlined []byte rather than a zero-length []byte at offset 0 of the buffer. (Maybe you meant to do that, since you wrote nonInlined instead of inlined in your comment above?)
pkg/col/coldata/bytes.go, line 175 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
My initial WIP had a bug where in the
AppendSlicemethod which inflated the benchmark numbers. The updated numbers are here, and they are aligned with my expectation - the inlining is important, so I think the PR in its current form is the way to go.
Sounds good!
pkg/col/coldata/bytes.go, line 86 at r9 (raw file):
// BytesMaxInlineLength is the maximum length of a []byte that can be inlined // within element. const BytesMaxInlineLength = int(ElementSize - 2)
nit: Might be a little "safer" to use unsafe.Offsetof(element.inlinedLength) instead of ElementSize - 2 in case go ever puts padding after or between the last two fields.
pkg/col/coldata/bytes.go, line 186 at r9 (raw file):
// TODO(yuzefovich): consider renaming it to Slice and allowing modification of // "sliced" Bytes since we can now support semantics similar to Golang's slices. func (b *Bytes) Window(start, end int) *Bytes {
Even with this PR I still don't think it is safe to modify a window, because calls to Set could cause a reallocation of Bytes.buffer. Go slices don't have this problem (assignments to index expressions will never cause a reallocation of the underlying array).
Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)
Code quote:
// TODO(yuzefovich): consider renaming it to Slice and allowing modification of
// "sliced" Bytes since we can now support semantics similar to Golang's slices.pkg/col/coldata/bytes.go, line 304 at r9 (raw file):
// we can set the new length accordingly (the elements will be set // below). b.zero(newLen)
Why do we zero these? If we zero them then we won't be able to reuse spots in buffer for non-inlined elements.
pkg/col/coldata/bytes.go, line 307 at r9 (raw file):
b.elements = b.elements[:newLen] } for i := 0; i < len(srcElementsToCopy); i++ {
We might also need the reverse loop case from CopySlice when b == src && destIdx > srcStartIdx. Come to think of it, seems like AppendSlice could simply call CopySlice after ensuring the capacity of b but maybe I'm overthinking it.
pkg/sql/colexec/colexecspan/span_encoder_tmpl.go, line 127 at r9 (raw file):
func (op *_OP_STRING) next(batch coldata.Batch, startIdx, endIdx int) *coldata.Bytes { oldBytesSize := op.outputBytes.Size() if op.outputBytes == nil || op.outputBytes.Len() < endIdx-startIdx {
nit: Couldn't this check capacity instead of length? (But then I guess we would need a Cap method and a way to set the length?)
d456f08 to
90ba480
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I added another cleanup commit into this PR. I could see the last two commits not passing the bar for being merged during the stability period, still personally I'd like to get them in. Curious to hear your thoughts.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @michae2)
pkg/col/coldata/bytes.go, line 164 at r2 (raw file):
Previously, michae2 (Michael Erickson) wrote…
This is fine. No need to squeeze any more byte juice.
nit: But it might be nice to reverse the meaning of the
inlinedboolean so that the zeroelementwas a zero-length inlined[]byterather than a zero-length[]byteat offset 0 of the buffer. (Maybe you meant to do that, since you wrotenonInlinedinstead ofinlinedin your comment above?)
I did reverse the meaning from nonInlined to inlined in the latest set of changes (like a couple of days ago).
pkg/col/coldata/bytes.go, line 186 at r9 (raw file):
calls to
Setcould cause a reallocation ofBytes.buffer
I think everything would still work, no?
Let's think through an example (imagine we have zero inlinable space):
b := NewBytes(2)
b.Set(0, []byte("zero"))
b.Set(1, []byte("one"))
At this point we have:
buffer = []byte("zeroone")
elements = []element{{offset: 0, len: 4}, {offset: 4, len: 3}}
Now we do b.Window(0, 1).Set(0, "long_string") which will cause reallocation of the buffer:
buffer = []byte("zeroonelong_string")
elements = []element{{offset: 7, len: 11}, {offset: 4, len: 3}}
We have successfully overwritten the zeroth element in the window as well as in the original vector. The only thing is that buffer[:4] is now garbage and will never be used again, unless the vector is reset.
Am I missing something?
Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)
I've thought a bit about this too previously. We could imagine having something ImmutableVec interface that doesn't include any mutation methods and making Vec interface embed ImmutableVec and add those mutation methods. This will - probably - require quite a bit of changes in the plumbing, and I thought it wasn't worth it. Additionally, it will likely come with some cost since we're performing all operations on the well-typed structs, so if we were to pull methods like Get and Set into the interface, we'd have to pay the interface dispatch cost. Alternatively, we would need to have separate well-typed structs - something like ImmutableBytes and Bytes.
OTOH we only have a handful of use cases where we're using these "windows" (like when we're spilling to disk from an in-memory operator in order to "export the buffered data"), and I believe if in those cases we actually did modify the window, then fakedist-disk logic test config would have crashed or returned incorrect results.
pkg/col/coldata/bytes.go, line 304 at r9 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Why do we zero these? If we zero them then we won't be able to reuse spots in buffer for non-inlined elements.
Hm. I think I wanted to maintain an invariant that all elements in b.elements[len:cap] range are uninitialized which makes Reset faster when len < cap.
Consider the following scenario with zero inlinable space:
b := NewBytes(2)
b.Set(0, "foo")
b.Set(1, "bar")
b.Set(0, "fooo")
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}
// Truncate b to length 1.
b.AppendSlice(b, 1, 0, 0)
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}, len(elements) = 1
// Reset with zeroing out elements only up to len(b.elements)
b.Reset()
// buffer: len = 0, cap = 10
// elements = {{}, {offset: 3, len: 3}}, len(elements) = 1
b.Set(0, "zero")
// buffer = "zero", cap = 10
// elements = {{offset: 0, len: 4}, {offset: 3, len: 3}}, len(elements) = 1
b2 := NewBytes(1)
b2.Set(0, "one")
// Here we will get corruption because b.elements[1] thinks it can reuse
// buffer[3:6], but the zeroth element uses buffer[0:4].
b.AppendSlice(b2, 1 /* destIdx */, 0 /* startIdx */, 1 /* endIdx */)
If this is right, then we either have to zero out all elements after newLen in AppendSlice or to zero out all elements up to cap(b.elements) in Reset.
Update: this is, indeed, a problem, and I converted this snippet into a unit test. I also decided to switch to zeroing out elements up to the capacity in Reset.
pkg/col/coldata/bytes.go, line 307 at r9 (raw file):
AppendSlice from the vector into itself is only allowed when srcStartIdx == srcEndIdx in which case there is no overlap, so we don't have to be too smart about it. This is why I opened #76529 some time ago.
Come to think of it, seems like
AppendSlicecould simply callCopySliceafter ensuring the capacity of b
That's a nice observation - I factored out a common helper method using the same optimization as we had in the append case. I also added some BCEs, and the benchmarks have improved for CopySlice.
pkg/sql/colexec/colexecspan/span_encoder_tmpl.go, line 127 at r9 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: Couldn't this check capacity instead of length? (But then I guess we would need a
Capmethod and a way to set the length?)
Given that we're now using Sets in this file, checking the length is the right thing to do - Sets are only allowed in [0, len] range.
michae2
left a comment
There was a problem hiding this comment.
I think it would be fine to merge the last two commits.
Reviewed 1 of 15 files at r1, 2 of 8 files at r8, 8 of 8 files at r13, 5 of 5 files at r14, 2 of 2 files at r15, 3 of 3 files at r16, 17 of 17 files at r17, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/col/coldata/bytes.go, line 186 at r9 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
calls to
Setcould cause a reallocation ofBytes.bufferI think everything would still work, no?
Let's think through an example (imagine we have zero inlinable space):
b := NewBytes(2) b.Set(0, []byte("zero")) b.Set(1, []byte("one"))At this point we have:
buffer = []byte("zeroone") elements = []element{{offset: 0, len: 4}, {offset: 4, len: 3}}Now we do
b.Window(0, 1).Set(0, "long_string")which will cause reallocation of the buffer:buffer = []byte("zeroonelong_string") elements = []element{{offset: 7, len: 11}, {offset: 4, len: 3}}We have successfully overwritten the zeroth element in the window as well as in the original vector. The only thing is that
buffer[:4]is now garbage and will never be used again, unless the vector is reset.Am I missing something?
Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)
I've thought a bit about this too previously. We could imagine having something
ImmutableVecinterface that doesn't include any mutation methods and makingVecinterface embedImmutableVecand add those mutation methods. This will - probably - require quite a bit of changes in the plumbing, and I thought it wasn't worth it. Additionally, it will likely come with some cost since we're performing all operations on the well-typed structs, so if we were to pull methods likeGetandSetinto the interface, we'd have to pay the interface dispatch cost. Alternatively, we would need to have separate well-typed structs - something likeImmutableBytesandBytes.OTOH we only have a handful of use cases where we're using these "windows" (like when we're spilling to disk from an in-memory operator in order to "export the buffered data"), and I believe if in those cases we actually did modify the window, then
fakedist-disklogic test config would have crashed or returned incorrect results.
Splitting this b.Window(0, 1).Set(0, "long_string") out into:
c := b.Window(0, 1)
c.Set(0, "long_string")Won't the call to c.Set assign the reallocated buffer to c.buffer but not b.buffer?
Anyway, I'm not requesting any changes to the code 😆 just agreeing with the current immutableness of Window.
pkg/col/coldata/bytes.go, line 304 at r9 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm. I think I wanted to maintain an invariant that all
elements inb.elements[len:cap]range are uninitialized which makesResetfaster whenlen < cap.Consider the following scenario with zero inlinable space:
b := NewBytes(2) b.Set(0, "foo") b.Set(1, "bar") b.Set(0, "fooo") // buffer = "foobarfooo" // elements = {{offset: 6, len: 4}, {offset: 3, len: 3}} // Truncate b to length 1. b.AppendSlice(b, 1, 0, 0) // buffer = "foobarfooo" // elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}, len(elements) = 1 // Reset with zeroing out elements only up to len(b.elements) b.Reset() // buffer: len = 0, cap = 10 // elements = {{}, {offset: 3, len: 3}}, len(elements) = 1 b.Set(0, "zero") // buffer = "zero", cap = 10 // elements = {{offset: 0, len: 4}, {offset: 3, len: 3}}, len(elements) = 1 b2 := NewBytes(1) b2.Set(0, "one") // Here we will get corruption because b.elements[1] thinks it can reuse // buffer[3:6], but the zeroth element uses buffer[0:4]. b.AppendSlice(b2, 1 /* destIdx */, 0 /* startIdx */, 1 /* endIdx */)If this is right, then we either have to zero out all elements after
newLeninAppendSliceor to zero out all elements up tocap(b.elements)inReset.
Update: this is, indeed, a problem, and I converted this snippet into a unit test. I also decided to switch to zeroing out elements up to the capacity in
Reset.
Interesting! Thank you for the example, I wasn't thinking about Reset at all (or using AppendSlice for truncation, ha). Zeroing up to capacity in Reset makes sense to me.
pkg/col/coldata/bytes.go, line 137 at r14 (raw file):
// Check if we there was an old non-inlined value we can overwrite. if !e.inlined && e.header.cap >= len(v) { copy(e.getNonInlined(b)[:len(v)], v)
nit: If element.getNonInlined took a length argument then we wouldn't need the [:len(v)] expression here. But maybe that's too finicky.
This commit changes the way `Bytes` vector is implemented in order to
support `Set` operation at arbitrary positions. Now, a `Bytes` vector is
a slice of `element`s.
`element` describes a single `[]byte` value. If a value doesn't exceed
30 bytes, then the whole value is stored within the `element`; otherwise,
the `element` "points" at a region in a common `Bytes.buffer` where the
non-inlined value is stored.
If the value is inlined, then the layout of `element` is used as follows:
```
24-byte header | 6-byte padding
element: .............................. | length | true
Bytes.buffer: N/A
```
where 30 dots describe the inlinable space followed by a single byte for the
length followed by a boolean `true` indicating an inlined value.
If the value is non-inlined, then the layout of `element` is used as follows:
```
padding
element: .offset. | ..len... | ..cap... | xxxxxx | x | false
Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx
```
where first 24 bytes contain our custom "header" of a byte slice that is
backed by `Bytes.buffer`. The following 7 bytes (the padding and the
`inlinedLength` byte) are not used, and then we have a boolean `false`
indicating a non-inlined value.
We have a custom "slice header" for the non-inlined values and don't
just use `[]byte` for two reasons:
1. we can easily overwrite the whole struct when a value is inlined without
causing any issues with the GC (because our slice header doesn't contain an
actual slice, GC doesn't have to track whether the backing array is live or
not);
2. our slice header remains valid even when `Bytes.buffer` is reallocated.
If we were to use `[]byte`, then whenever that reallocation occurred, we
would have to iterate over all non-inlined values and update the byte
slices to point to the new `Bytes.buffer`.
This new implementation was inspired by Velox library from Facebook
(https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types).
Our original implementation was based on the Arrow format, so we were
getting extremely cheap serialization/deserialization. With this change
some of the benchmarks have regressed significantly:
```
Append/bytes/AppendSimple/NullProbability=0.0-24 356MB/s ± 5% 221MB/s ± 1% -37.81% (p=0.000 n=9+8)
Append/bytes/AppendWithSel/NullProbability=0.0-24 241MB/s ± 2% 105MB/s ± 4% -56.49% (p=0.000 n=9+10)
Append/bytes/AppendSimple/NullProbability=0.2-24 354MB/s ± 6% 219MB/s ± 4% -38.11% (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24 229MB/s ± 3% 98MB/s ± 4% -57.20% (p=0.000 n=9+10)
Copy/bytes/CopySimple/NullProbability=0.0-24 39.7GB/s ± 1% 0.5GB/s ± 0% -98.67% (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24 753MB/s ± 1% 544MB/s ± 0% -27.83% (p=0.000 n=10+8)
Copy/bytes/CopySimple/NullProbability=0.2-24 28.9GB/s ± 1% 0.5GB/s ± 0% -98.19% (p=0.000 n=8+10)
Copy/bytes/CopyWithSel/NullProbability=0.2-24 634MB/s ± 0% 575MB/s ± 0% -9.37% (p=0.000 n=9+9)
```
```
ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24 248GB/s ± 2% 3GB/s ± 3% -98.90% (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24 240GB/s ± 2% 3GB/s ± 2% -98.87% (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24 242GB/s ± 1% 3GB/s ± 2% -98.86% (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24 830GB/s ± 2% 3GB/s ± 2% -99.59% (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24 893GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+8)
ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24 910GB/s ± 1% 3GB/s ± 1% -99.62% (p=0.000 n=10+9)
```
I'm not particularly worried about the regression in the serialization
(since the absolute speeds are still very high for this to not be
a bottle neck), but the slowdown in `Append/Copy` benchmark is
worrisome. I do have some ideas of how to improve those operations to
reduce the regressions to acceptable levels, and I'll add those
optimizations in a follow-up commit.
As a proxy for change of performance of `Bytes` vector overall, I ran
`BenchmarkSortUUID` with results [here](https://gist.github.com/yuzefovich/baaf6b4096cb3e2f1504ad2b38431cc6)
and the unordered distinct benchmarks with results [here](https://gist.github.com/yuzefovich/d8b9166977ec56780cb1749e848f5935).
They show that the number of allocations has decreased in all cases
although the total allocated space increased in some cases with small
number of rows. As a result, there are some regressions in performance
on a small number of rows. Overall, results seem satisfactory.
Notably, as a result of this change the performance of top K sort has
improved drastically when bytes-like types are present in the input
columns (because `CopySlice` used in top K sorter no longer has to move
the data around).
It's worth calling out that we still use the same Arrow format for
serialization, so there is no need to bump the DistSQL version because
the change of this commit is purely to the in-memory representation.
An additional nice result of this change is that we no longer have to
maintain the non-decreasing offsets, which will allow us to remove calls
to `SetLength` in many places where the length of the batch is not
changed. This will be done in the follow-up commit.
Release note: None
Release justification: low risk, high benefit changes to existing
functionality.
This commit optimizes `Append/Copy` operations for `Bytes` vectors. The main idea is that a new primitive `copy` is introduced which allows for copying one `element` from one vector into another `element` of another vector which removes the allocations when copying inlined values. An additional improvement is preallocating enough `element`s to copy/append into many values. ``` name old speed new speed delta Append/bytes/AppendSimple/NullProbability=0.0-24 226MB/s ± 7% 221MB/s ± 8% ~ (p=0.143 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 104MB/s ± 5% 189MB/s ± 1% +81.57% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 220MB/s ± 2% 229MB/s ± 7% +4.23% (p=0.004 n=8+10) Append/bytes/AppendWithSel/NullProbability=0.2-24 98.4MB/s ± 5% 197.7MB/s ± 4% +100.98% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.0-24 521MB/s ± 0% 2981MB/s ± 1% +471.63% (p=0.000 n=10+10) Copy/bytes/CopyWithSel/NullProbability=0.0-24 539MB/s ± 0% 1020MB/s ± 0% +89.32% (p=0.000 n=10+10) Copy/bytes/CopySimple/NullProbability=0.2-24 518MB/s ± 0% 2913MB/s ± 0% +462.62% (p=0.000 n=10+8) Copy/bytes/CopyWithSel/NullProbability=0.2-24 560MB/s ± 0% 875MB/s ± 0% +56.12% (p=0.000 n=10+8) name old alloc/op new alloc/op delta Append/bytes/AppendSimple/NullProbability=0.0-24 76.1kB ± 8% 118.9kB ± 7% +56.23% (p=0.000 n=10+10) Append/bytes/AppendWithSel/NullProbability=0.0-24 192kB ± 2% 131kB ± 1% -32.15% (p=0.000 n=10+8) Append/bytes/AppendSimple/NullProbability=0.2-24 78.0kB ± 0% 117.2kB ± 3% +50.27% (p=0.000 n=8+9) Append/bytes/AppendWithSel/NullProbability=0.2-24 198kB ± 1% 105kB ± 3% -47.21% (p=0.000 n=10+9) Copy/bytes/CopySimple/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.0-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopySimple/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) Copy/bytes/CopyWithSel/NullProbability=0.2-24 0.00B 0.00B ~ (all equal) ``` I'm not sure what's up with some of the increase in allocations since I don't see them locally on mac. Release note: None Release justification: low risk, high benefit changes to existing functionality.
Since the previous commit made it so that `Bytes` vectors support Sets at arbitrary positions, we no longer have to use `CopySlice` in the top K sort implementation. Release note: None Release justification: low risk, high benefit changes to existing functionality.
Previously, `Append`ing from the vector into itself was allowed when `srcStartIdx == srcEndIdx` in order to be able to truncate the vector. One of the previous commits removed the only place where this was actually used (when implementing `Append` for Bytes-like types when selection vector is present), so this commit removes this no-longer-used feature. Release note: None Release justification: low risk cleanup.
Previously, we required that operators that might populate a `Bytes` vector set the length on the output batch (which was needed in order to maintain the invariant of non-decreasing offsets in the old implementation of the `Bytes` vector) even if the length of the batch didn't change. This is no longer needed given the updated implementation, so this commit removes those redundant calls to `SetLength`. Release note: None Release justification: low risk cleanup.
90ba480 to
ec8e548
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Thanks for the reviews and the feedback!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @michae2)
pkg/col/coldata/bytes.go, line 186 at r9 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Splitting this
b.Window(0, 1).Set(0, "long_string")out into:c := b.Window(0, 1) c.Set(0, "long_string")Won't the call to
c.Setassign the reallocated buffer toc.bufferbut notb.buffer?Anyway, I'm not requesting any changes to the code 😆 just agreeing with the current immutableness of Window.
Oh, indeed. Removed the TODO.
pkg/col/coldata/bytes.go, line 137 at r14 (raw file):
Previously, michae2 (Michael Erickson) wrote…
nit: If
element.getNonInlinedtook a length argument then we wouldn't need the[:len(v)]expression here. But maybe that's too finicky.
I refactored the code a bit to remove this.
|
Build succeeded: |
| // bytes value begins. | ||
| bufferOffset int | ||
| len int | ||
| cap int |
There was a problem hiding this comment.
Drive by comment, should we be explicit to use int64 here? Since the size of int is platform dependent. On 32-bit platform, this would mean the sliceHeader would be 12 bytes instead of 24 bytes intended.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/col/coldata/bytes.go, line 77 at r18 (raw file):
Previously, Azhng (Archer Zhang) wrote…
Drive by comment, should we be explicit to use
int64here? Since the size ofintis platform dependent. On 32-bit platform, this would mean thesliceHeaderwould be 12 bytes instead of 24 bytes intended.
I thought about it and chose to keep using int.
Everything will still work correctly on 32-bit systems, we'll just have less inlinable space (18 bytes instead of 30) and we won't be using the whole size of the object in the Golang's size class (element will take up 20 bytes, and it'll be in 24 byte object class, so we could have used 4 bytes for more padding; alternatively, we could have used less padding to get into 16 byte class).
I think that 32-bit systems are quite rare nowadays, and we usually just assume that we're running on a 64-bit system.
coldata: change Bytes vector implementation
This commit changes the way
Bytesvector is implemented in order tosupport
Setoperation at arbitrary positions. Now, aBytesvector isa slice of
elements.elementdescribes a single[]bytevalue. If a value doesn't exceed30 bytes, then the whole value is stored within the
element; otherwise,the
element"points" at a region in a commonBytes.bufferwhere thenon-inlined value is stored.
If the value is inlined, then the layout of
elementis used as follows:where 30 dots describe the inlinable space followed by a single byte for the
length followed by a boolean
trueindicating an inlined value.If the value is non-inlined, then the layout of
elementis used as follows:where first 24 bytes contain our custom "header" of a byte slice that is
backed by
Bytes.buffer. The following 7 bytes (the padding and theinlinedLengthbyte) are not used, and then we have a booleanfalseindicating a non-inlined value.
We have a custom "slice header" for the non-inlined values and don't
just use
[]bytefor two reasons:causing any issues with the GC (because our slice header doesn't contain an
actual slice, GC doesn't have to track whether the backing array is live or
not);
Bytes.bufferis reallocated.If we were to use
[]byte, then whenever that reallocation occurred, wewould have to iterate over all non-inlined values and update the byte
slices to point to the new
Bytes.buffer.This new implementation was inspired by Velox library from Facebook
(https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types).
Our original implementation was based on the Arrow format, so we were
getting extremely cheap serialization/deserialization. With this change
some of the benchmarks have regressed significantly:
I'm not particularly worried about the regression in the serialization
(since the absolute speeds are still very high for this to not be
a bottle neck), but the slowdown in
Append/Copybenchmark isworrisome. I do have some ideas of how to improve those operations to
reduce the regressions to acceptable levels, and I'll add those
optimizations in a follow-up commit.
As a proxy for change of performance of
Bytesvector overall, I ranBenchmarkSortUUIDwith results hereand the unordered distinct benchmarks with results here.
They show that the number of allocations has decreased in all cases
although the total allocated space increased in some cases with small
number of rows. As a result, there are some regressions in performance
on a small number of rows. Overall, results seem satisfactory.
Notably, as a result of this change the performance of top K sort has
improved drastically when bytes-like types are present in the input
columns (because
CopySliceused in top K sorter no longer has to movethe data around).
It's worth calling out that we still use the same Arrow format for
serialization, so there is no need to bump the DistSQL version because
the change of this commit is purely to the in-memory representation.
An additional nice result of this change is that we no longer have to
maintain the non-decreasing offsets, which will allow us to remove calls
to
SetLengthin many places where the length of the batch is notchanged. This will be done in the follow-up commit.
Fixes: #75536.
Release note: None
Release justification: low risk, high benefit changes to existing
functionality.
coldata: optimize Append and Copy for Bytes vectors
This commit optimizes
Append/Copyoperations forBytesvectors. Themain idea is that a new primitive
copyis introduced which allows forcopying one
elementfrom one vector into anotherelementof anothervector which removes the allocations when copying inlined values.
An additional improvement is preallocating enough
elements tocopy/append into many values.
I'm not sure what's up with some of the increase in allocations since
I don't see them locally on mac.
Release note: None
Release justification: low risk, high benefit changes to existing
functionality.
colexec: cleanup top K with bytes-like types
Since the previous commit made it so that
Bytesvectors support Setsat arbitrary positions, we no longer have to use
CopySlicein the topK sort implementation.
Release note: None
Release justification: low risk, high benefit changes to existing
functionality.
coldata: prohibit Append from the vector into itself
Previously,
Appending from the vector into itself was allowed whensrcStartIdx == srcEndIdxin order to be able to truncate the vector.One of the previous commits removed the only place where this was
actually used (when implementing
Appendfor Bytes-like types whenselection vector is present), so this commit removes this no-longer-used
feature.
Release note: None
Release justification: low risk cleanup.
colexec: remove no longer needed calls to SetLength
Previously, we required that operators that might populate a
Bytesvector set the length on the output batch (which was needed in order to
maintain the invariant of non-decreasing offsets in the old
implementation of the
Bytesvector) even if the length of the batchdidn't change. This is no longer needed given the updated
implementation, so this commit removes those redundant calls to
SetLength.Release note: None
Release justification: low risk cleanup.