Skip to content

coldata: change Bytes vector implementation#76202

Merged
craig[bot] merged 5 commits intocockroachdb:masterfrom
yuzefovich:bytes-refactor
Mar 9, 2022
Merged

coldata: change Bytes vector implementation#76202
craig[bot] merged 5 commits intocockroachdb:masterfrom
yuzefovich:bytes-refactor

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Feb 7, 2022

coldata: change Bytes vector implementation

This commit changes the way Bytes vector is implemented in order to
support Set operation at arbitrary positions. Now, a Bytes vector is
a slice of elements.

element describes a single []byte value. If a value doesn't exceed
30 bytes, then the whole value is stored within the element; otherwise,
the element "points" at a region in a common Bytes.buffer where the
non-inlined value is stored.

If the value is inlined, then the layout of element is used as follows:

           24-byte header | 6-byte padding
  element: .............................. | length | true
  Bytes.buffer: N/A

where 30 dots describe the inlinable space followed by a single byte for the
length followed by a boolean true indicating an inlined value.

If the value is non-inlined, then the layout of element is used as follows:

                                            padding
  element: .offset. | ..len... | ..cap... | xxxxxx | x | false
  Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx

where first 24 bytes contain our custom "header" of a byte slice that is
backed by Bytes.buffer. The following 7 bytes (the padding and the
inlinedLength byte) are not used, and then we have a boolean false
indicating a non-inlined value.

We have a custom "slice header" for the non-inlined values and don't
just use []byte for two reasons:

  1. we can easily overwrite the whole struct when a value is inlined without
    causing any issues with the GC (because our slice header doesn't contain an
    actual slice, GC doesn't have to track whether the backing array is live or
    not);
  2. our slice header remains valid even when Bytes.buffer is reallocated.
    If we were to use []byte, then whenever that reallocation occurred, we
    would have to iterate over all non-inlined values and update the byte
    slices to point to the new Bytes.buffer.

This new implementation was inspired by Velox library from Facebook
(https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types).

Our original implementation was based on the Arrow format, so we were
getting extremely cheap serialization/deserialization. With this change
some of the benchmarks have regressed significantly:

Append/bytes/AppendSimple/NullProbability=0.0-24    356MB/s ± 5%   221MB/s ± 1%    -37.81%  (p=0.000 n=9+8)
Append/bytes/AppendWithSel/NullProbability=0.0-24   241MB/s ± 2%   105MB/s ± 4%    -56.49%  (p=0.000 n=9+10)
Append/bytes/AppendSimple/NullProbability=0.2-24    354MB/s ± 6%   219MB/s ± 4%    -38.11%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24   229MB/s ± 3%    98MB/s ± 4%    -57.20%  (p=0.000 n=9+10)
Copy/bytes/CopySimple/NullProbability=0.0-24       39.7GB/s ± 1%   0.5GB/s ± 0%    -98.67%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       753MB/s ± 1%   544MB/s ± 0%    -27.83%  (p=0.000 n=10+8)
Copy/bytes/CopySimple/NullProbability=0.2-24       28.9GB/s ± 1%   0.5GB/s ± 0%    -98.19%  (p=0.000 n=8+10)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       634MB/s ± 0%   575MB/s ± 0%     -9.37%  (p=0.000 n=9+9)
ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24   248GB/s ± 2%      3GB/s ± 3%     -98.90%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24   240GB/s ± 2%      3GB/s ± 2%     -98.87%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24   242GB/s ± 1%      3GB/s ± 2%     -98.86%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24   830GB/s ± 2%      3GB/s ± 2%     -99.59%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24   893GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+8)
ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24   910GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+9)

I'm not particularly worried about the regression in the serialization
(since the absolute speeds are still very high for this to not be
a bottle neck), but the slowdown in Append/Copy benchmark is
worrisome. I do have some ideas of how to improve those operations to
reduce the regressions to acceptable levels, and I'll add those
optimizations in a follow-up commit.

As a proxy for change of performance of Bytes vector overall, I ran
BenchmarkSortUUID with results here
and the unordered distinct benchmarks with results here.
They show that the number of allocations has decreased in all cases
although the total allocated space increased in some cases with small
number of rows. As a result, there are some regressions in performance
on a small number of rows. Overall, results seem satisfactory.

Notably, as a result of this change the performance of top K sort has
improved drastically when bytes-like types are present in the input
columns (because CopySlice used in top K sorter no longer has to move
the data around).

It's worth calling out that we still use the same Arrow format for
serialization, so there is no need to bump the DistSQL version because
the change of this commit is purely to the in-memory representation.

An additional nice result of this change is that we no longer have to
maintain the non-decreasing offsets, which will allow us to remove calls
to SetLength in many places where the length of the batch is not
changed. This will be done in the follow-up commit.

Fixes: #75536.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

coldata: optimize Append and Copy for Bytes vectors

This commit optimizes Append/Copy operations for Bytes vectors. The
main idea is that a new primitive copy is introduced which allows for
copying one element from one vector into another element of another
vector which removes the allocations when copying inlined values.

An additional improvement is preallocating enough elements to
copy/append into many values.

name                                               old speed      new speed       delta
Append/bytes/AppendSimple/NullProbability=0.0-24    226MB/s ± 7%    221MB/s ± 8%      ~     (p=0.143 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24   104MB/s ± 5%    189MB/s ± 1%   +81.57%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24    220MB/s ± 2%    229MB/s ± 7%    +4.23%  (p=0.004 n=8+10)
Append/bytes/AppendWithSel/NullProbability=0.2-24  98.4MB/s ± 5%  197.7MB/s ± 4%  +100.98%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.0-24        521MB/s ± 0%   2981MB/s ± 1%  +471.63%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       539MB/s ± 0%   1020MB/s ± 0%   +89.32%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.2-24        518MB/s ± 0%   2913MB/s ± 0%  +462.62%  (p=0.000 n=10+8)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       560MB/s ± 0%    875MB/s ± 0%   +56.12%  (p=0.000 n=10+8)

name                                               old alloc/op   new alloc/op    delta
Append/bytes/AppendSimple/NullProbability=0.0-24     76.1kB ± 8%    118.9kB ± 7%   +56.23%  (p=0.000 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24     192kB ± 2%      131kB ± 1%   -32.15%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24     78.0kB ± 0%    117.2kB ± 3%   +50.27%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24     198kB ± 1%      105kB ± 3%   -47.21%  (p=0.000 n=10+9)
Copy/bytes/CopySimple/NullProbability=0.0-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.0-24         0.00B           0.00B           ~     (all equal)
Copy/bytes/CopySimple/NullProbability=0.2-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.2-24         0.00B           0.00B           ~     (all equal)

I'm not sure what's up with some of the increase in allocations since
I don't see them locally on mac.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

colexec: cleanup top K with bytes-like types

Since the previous commit made it so that Bytes vectors support Sets
at arbitrary positions, we no longer have to use CopySlice in the top
K sort implementation.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.

coldata: prohibit Append from the vector into itself

Previously, Appending from the vector into itself was allowed when
srcStartIdx == srcEndIdx in order to be able to truncate the vector.
One of the previous commits removed the only place where this was
actually used (when implementing Append for Bytes-like types when
selection vector is present), so this commit removes this no-longer-used
feature.

Release note: None

Release justification: low risk cleanup.

colexec: remove no longer needed calls to SetLength

Previously, we required that operators that might populate a Bytes
vector set the length on the output batch (which was needed in order to
maintain the invariant of non-decreasing offsets in the old
implementation of the Bytes vector) even if the length of the batch
didn't change. This is no longer needed given the updated
implementation, so this commit removes those redundant calls to
SetLength.

Release note: None

Release justification: low risk cleanup.

@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Feb 7, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the bytes-refactor branch 8 times, most recently from b31554a to 61c086d Compare February 8, 2022 17:09
@jordanlewis
Copy link
Copy Markdown
Member

Neat, what is the significant of 19 bytes?

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the rationale in the commit message (not yet to the PR description), but my thinking is that 16 bytes is the minimum acceptable (so that UUIDs are inlined), and we have at least 3 bytes of padding which gives us 19 bytes. We could add more padding in order to increase inlinable space at the cost of larger allocations, but at the moment a single element is 32 bytes which seems just right to me. It's a bit unfortunate that we cannot use 8 bytes from the slice header, but reusing SliceHeader.Data field leads to issues with the GC.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@yuzefovich yuzefovich force-pushed the bytes-refactor branch 2 times, most recently from e49e708 to d406fb5 Compare February 8, 2022 21:29
@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Feb 8, 2022
@yuzefovich yuzefovich marked this pull request as ready for review February 8, 2022 21:30
@yuzefovich yuzefovich requested a review from a team as a code owner February 8, 2022 21:30
@yuzefovich
Copy link
Copy Markdown
Member Author

cc @nvanbenschoten in case you're interested in taking a look at the first commit

@yuzefovich
Copy link
Copy Markdown
Member Author

I think we can basically ignore the regression in speed of serialization/deserialization since it's still in GB/s range, and I have some improvements for Append/Copy (the regressions on which are more worrisome), the numbers are here. I'll need more time to polish those up, and I think it's better to not include them into this PR for ease of reviewing anyway.

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spicy! 🌶️ 🙂

I buy your argument about the ArrowBatchConverter regressions. Being able to Bytes.Set() seems like a big win.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)


pkg/col/coldata/bytes.go, line 61 at r2 (raw file):

	// element was valid (namely, it'll contain a non-inlined value with an
	// empty []byte).
	nonInlined     bool

You might be able to inline this, too. (Isn't data.Data always nil when the value is inlined and non-nil when the value is pointed to?)


pkg/col/coldata/bytes.go, line 62 at r2 (raw file):

	// empty []byte).
	nonInlined     bool
	lengthOrOffset int32

If this were only used by inlined data, you could also inline it. (E.g. you could use the first inline byte for the length of inlined data.)


pkg/col/coldata/bytes.go, line 164 at r2 (raw file):

					e.data = b.buffer[e.lengthOrOffset : e.lengthOrOffset+int32(len(e.data)) : e.lengthOrOffset+int32(cap(e.data))]
				}
			}

If element.data were something like {Offset, Len, Cap} rather than {Ptr, Len, Cap} then you wouldn't need to do this update. And bonus, you might be able to use more of element for inlining. (This takes go GC out of the loop, of course, which could cause problems if we're relying on go GC to know when to free Bytes.buffer based on some element outliving its parent Bytes. But if an element never outlives its parent Bytes then it could be fine.)


pkg/col/coldata/bytes.go, line 175 at r2 (raw file):

type Bytes struct {
	// elements contains all values stored in this Bytes vector.
	elements []element

I see, you already implemented the "second layer" @msirek was talking about today with this elements slice. I wonder how much of the performance changes are due to being able to Bytes.Set() using this second layer, rather than due to inlining? The inlining seems like a small optimization on top of this more fundamental change... but maybe I'm misjudging it?

Just curious, did you happen to run any benchmarks with only the second layer, no inlining? If not, would you be willing to try some benchmarks with BytesMaxInlineLength = -1?

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I like your suggestions, currently have a WIP to implement them, but will need to polish the code more. I'll put this PR aside for a couple of days and will likely pick it back up on Friday.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)

@yuzefovich yuzefovich force-pushed the bytes-refactor branch 2 times, most recently from e513b29 to 97b3ac6 Compare February 12, 2022 06:05
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is RFAL. I added a separate commit to add some optimizations (kept separately for ease of reviewing). The performance regressions for Append/Copy have been reduced to an acceptable level (and there are some speedups too), here are the overall change of this PR on SortUUID and unordered distinct.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)


pkg/col/coldata/bytes.go, line 61 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

You might be able to inline this, too. (Isn't data.Data always nil when the value is inlined and non-nil when the value is pointed to?)

Refactored.


pkg/col/coldata/bytes.go, line 62 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If this were only used by inlined data, you could also inline it. (E.g. you could use the first inline byte for the length of inlined data.)

I decided to keep a single byte for inlinedLength separately to reduce the complexity.


pkg/col/coldata/bytes.go, line 164 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

If element.data were something like {Offset, Len, Cap} rather than {Ptr, Len, Cap} then you wouldn't need to do this update. And bonus, you might be able to use more of element for inlining. (This takes go GC out of the loop, of course, which could cause problems if we're relying on go GC to know when to free Bytes.buffer based on some element outliving its parent Bytes. But if an element never outlives its parent Bytes then it could be fine.)

Nice! This is a very good point, and I implemented your suggestion. Now we have 30 bytes of inlinable space and don't have to update pointers for non-inlined elements when Bytes.buffer is reallocated.

We could have squeezed one more byte of inlinable space by merging inlinedLength and nonInlined boolean into a single byte, but I feel like that extra complexity is not worth it. Thoughts?


pkg/col/coldata/bytes.go, line 175 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

I see, you already implemented the "second layer" @msirek was talking about today with this elements slice. I wonder how much of the performance changes are due to being able to Bytes.Set() using this second layer, rather than due to inlining? The inlining seems like a small optimization on top of this more fundamental change... but maybe I'm misjudging it?

Just curious, did you happen to run any benchmarks with only the second layer, no inlining? If not, would you be willing to try some benchmarks with BytesMaxInlineLength = -1?

My guess is that inlining is quite important. I'll run some benchmarks with inlining disabled next week.

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)


pkg/col/coldata/bytes.go, line 175 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

My guess is that inlining is quite important. I'll run some benchmarks with inlining disabled next week.

Hm, the benchmarks here show that there are some improvements and some regressions without any inlining when comparing against the current tip of this PR.

I think maybe the best approach will be with no inlining, but with preallocating buffer with some fixed width (say 16 bytes for each element) right away. I'll prototype that at some point, so hold off on reviewing for now.

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michae2 PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)


pkg/col/coldata/bytes.go, line 175 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, the benchmarks here show that there are some improvements and some regressions without any inlining when comparing against the current tip of this PR.

I think maybe the best approach will be with no inlining, but with preallocating buffer with some fixed width (say 16 bytes for each element) right away. I'll prototype that at some point, so hold off on reviewing for now.

My initial WIP had a bug where in the AppendSlice method which inflated the benchmark numbers. The updated numbers are here, and they are aligned with my expectation - the inlining is important, so I think the PR in its current form is the way to go.

@yuzefovich yuzefovich force-pushed the bytes-refactor branch 2 times, most recently from 3c4f066 to 11259dc Compare March 6, 2022 02:44
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, done pushing, RFAL. The current benchmark numbers are here.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @msirek)

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. I'm still reading but have to pause for a bit.

Reviewed 6 of 15 files at r1, 1 of 17 files at r5, 5 of 8 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @michae2, and @yuzefovich)


pkg/col/coldata/bytes.go, line 164 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Nice! This is a very good point, and I implemented your suggestion. Now we have 30 bytes of inlinable space and don't have to update pointers for non-inlined elements when Bytes.buffer is reallocated.

We could have squeezed one more byte of inlinable space by merging inlinedLength and nonInlined boolean into a single byte, but I feel like that extra complexity is not worth it. Thoughts?

This is fine. No need to squeeze any more byte juice.

nit: But it might be nice to reverse the meaning of the inlined boolean so that the zero element was a zero-length inlined []byte rather than a zero-length []byte at offset 0 of the buffer. (Maybe you meant to do that, since you wrote nonInlined instead of inlined in your comment above?)


pkg/col/coldata/bytes.go, line 175 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

My initial WIP had a bug where in the AppendSlice method which inflated the benchmark numbers. The updated numbers are here, and they are aligned with my expectation - the inlining is important, so I think the PR in its current form is the way to go.

Sounds good!


pkg/col/coldata/bytes.go, line 86 at r9 (raw file):

// BytesMaxInlineLength is the maximum length of a []byte that can be inlined
// within element.
const BytesMaxInlineLength = int(ElementSize - 2)

nit: Might be a little "safer" to use unsafe.Offsetof(element.inlinedLength) instead of ElementSize - 2 in case go ever puts padding after or between the last two fields.


pkg/col/coldata/bytes.go, line 186 at r9 (raw file):

// TODO(yuzefovich): consider renaming it to Slice and allowing modification of
// "sliced" Bytes since we can now support semantics similar to Golang's slices.
func (b *Bytes) Window(start, end int) *Bytes {

Even with this PR I still don't think it is safe to modify a window, because calls to Set could cause a reallocation of Bytes.buffer. Go slices don't have this problem (assignments to index expressions will never cause a reallocation of the underlying array).

Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)

Code quote:

// TODO(yuzefovich): consider renaming it to Slice and allowing modification of
// "sliced" Bytes since we can now support semantics similar to Golang's slices.

pkg/col/coldata/bytes.go, line 304 at r9 (raw file):

		// we can set the new length accordingly (the elements will be set
		// below).
		b.zero(newLen)

Why do we zero these? If we zero them then we won't be able to reuse spots in buffer for non-inlined elements.


pkg/col/coldata/bytes.go, line 307 at r9 (raw file):

		b.elements = b.elements[:newLen]
	}
	for i := 0; i < len(srcElementsToCopy); i++ {

We might also need the reverse loop case from CopySlice when b == src && destIdx > srcStartIdx. Come to think of it, seems like AppendSlice could simply call CopySlice after ensuring the capacity of b but maybe I'm overthinking it.


pkg/sql/colexec/colexecspan/span_encoder_tmpl.go, line 127 at r9 (raw file):

func (op *_OP_STRING) next(batch coldata.Batch, startIdx, endIdx int) *coldata.Bytes {
	oldBytesSize := op.outputBytes.Size()
	if op.outputBytes == nil || op.outputBytes.Len() < endIdx-startIdx {

nit: Couldn't this check capacity instead of length? (But then I guess we would need a Cap method and a way to set the length?)

@yuzefovich yuzefovich force-pushed the bytes-refactor branch 2 times, most recently from d456f08 to 90ba480 Compare March 9, 2022 08:02
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added another cleanup commit into this PR. I could see the last two commits not passing the bar for being merged during the stability period, still personally I'd like to get them in. Curious to hear your thoughts.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @michae2)


pkg/col/coldata/bytes.go, line 164 at r2 (raw file):

Previously, michae2 (Michael Erickson) wrote…

This is fine. No need to squeeze any more byte juice.

nit: But it might be nice to reverse the meaning of the inlined boolean so that the zero element was a zero-length inlined []byte rather than a zero-length []byte at offset 0 of the buffer. (Maybe you meant to do that, since you wrote nonInlined instead of inlined in your comment above?)

I did reverse the meaning from nonInlined to inlined in the latest set of changes (like a couple of days ago).


pkg/col/coldata/bytes.go, line 186 at r9 (raw file):

calls to Set could cause a reallocation of Bytes.buffer

I think everything would still work, no?

Let's think through an example (imagine we have zero inlinable space):

b := NewBytes(2)
b.Set(0, []byte("zero"))
b.Set(1, []byte("one"))

At this point we have:

buffer = []byte("zeroone")
elements = []element{{offset: 0, len: 4}, {offset: 4, len: 3}}

Now we do b.Window(0, 1).Set(0, "long_string") which will cause reallocation of the buffer:

buffer = []byte("zeroonelong_string")
elements = []element{{offset: 7, len: 11}, {offset: 4, len: 3}}

We have successfully overwritten the zeroth element in the window as well as in the original vector. The only thing is that buffer[:4] is now garbage and will never be used again, unless the vector is reset.

Am I missing something?

Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)

I've thought a bit about this too previously. We could imagine having something ImmutableVec interface that doesn't include any mutation methods and making Vec interface embed ImmutableVec and add those mutation methods. This will - probably - require quite a bit of changes in the plumbing, and I thought it wasn't worth it. Additionally, it will likely come with some cost since we're performing all operations on the well-typed structs, so if we were to pull methods like Get and Set into the interface, we'd have to pay the interface dispatch cost. Alternatively, we would need to have separate well-typed structs - something like ImmutableBytes and Bytes.

OTOH we only have a handful of use cases where we're using these "windows" (like when we're spilling to disk from an in-memory operator in order to "export the buffered data"), and I believe if in those cases we actually did modify the window, then fakedist-disk logic test config would have crashed or returned incorrect results.


pkg/col/coldata/bytes.go, line 304 at r9 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Why do we zero these? If we zero them then we won't be able to reuse spots in buffer for non-inlined elements.

Hm. I think I wanted to maintain an invariant that all elements in b.elements[len:cap] range are uninitialized which makes Reset faster when len < cap.

Consider the following scenario with zero inlinable space:

b := NewBytes(2)
b.Set(0, "foo")
b.Set(1, "bar")
b.Set(0, "fooo")
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}

// Truncate b to length 1.
b.AppendSlice(b, 1, 0, 0)
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}, len(elements) = 1

// Reset with zeroing out elements only up to len(b.elements)
b.Reset()
// buffer: len = 0, cap = 10
// elements = {{}, {offset: 3, len: 3}}, len(elements) = 1

b.Set(0, "zero")
// buffer = "zero", cap = 10
// elements = {{offset: 0, len: 4}, {offset: 3, len: 3}}, len(elements) = 1

b2 := NewBytes(1)
b2.Set(0, "one")
// Here we will get corruption because b.elements[1] thinks it can reuse
// buffer[3:6], but the zeroth element uses buffer[0:4].
b.AppendSlice(b2, 1 /* destIdx */, 0 /* startIdx */, 1 /* endIdx */)

If this is right, then we either have to zero out all elements after newLen in AppendSlice or to zero out all elements up to cap(b.elements) in Reset.


Update: this is, indeed, a problem, and I converted this snippet into a unit test. I also decided to switch to zeroing out elements up to the capacity in Reset.


pkg/col/coldata/bytes.go, line 307 at r9 (raw file):
AppendSlice from the vector into itself is only allowed when srcStartIdx == srcEndIdx in which case there is no overlap, so we don't have to be too smart about it. This is why I opened #76529 some time ago.

Come to think of it, seems like AppendSlice could simply call CopySlice after ensuring the capacity of b

That's a nice observation - I factored out a common helper method using the same optimization as we had in the append case. I also added some BCEs, and the benchmarks have improved for CopySlice.


pkg/sql/colexec/colexecspan/span_encoder_tmpl.go, line 127 at r9 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Couldn't this check capacity instead of length? (But then I guess we would need a Cap method and a way to set the length?)

Given that we're now using Sets in this file, checking the length is the right thing to do - Sets are only allowed in [0, len] range.

Copy link
Copy Markdown
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm:

I think it would be fine to merge the last two commits.

Reviewed 1 of 15 files at r1, 2 of 8 files at r8, 8 of 8 files at r13, 5 of 5 files at r14, 2 of 2 files at r15, 3 of 3 files at r16, 17 of 17 files at r17, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)


pkg/col/coldata/bytes.go, line 186 at r9 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

calls to Set could cause a reallocation of Bytes.buffer

I think everything would still work, no?

Let's think through an example (imagine we have zero inlinable space):

b := NewBytes(2)
b.Set(0, []byte("zero"))
b.Set(1, []byte("one"))

At this point we have:

buffer = []byte("zeroone")
elements = []element{{offset: 0, len: 4}, {offset: 4, len: 3}}

Now we do b.Window(0, 1).Set(0, "long_string") which will cause reallocation of the buffer:

buffer = []byte("zeroonelong_string")
elements = []element{{offset: 7, len: 11}, {offset: 4, len: 3}}

We have successfully overwritten the zeroth element in the window as well as in the original vector. The only thing is that buffer[:4] is now garbage and will never be used again, unless the vector is reset.

Am I missing something?

Random thought: it might be more effective for "Window" to be a separate type that is simply missing the modification methods. Then these checks would happen at compile time instead of at (test) runtime. (But that's not a suggestion for this PR; just a random thought.)

I've thought a bit about this too previously. We could imagine having something ImmutableVec interface that doesn't include any mutation methods and making Vec interface embed ImmutableVec and add those mutation methods. This will - probably - require quite a bit of changes in the plumbing, and I thought it wasn't worth it. Additionally, it will likely come with some cost since we're performing all operations on the well-typed structs, so if we were to pull methods like Get and Set into the interface, we'd have to pay the interface dispatch cost. Alternatively, we would need to have separate well-typed structs - something like ImmutableBytes and Bytes.

OTOH we only have a handful of use cases where we're using these "windows" (like when we're spilling to disk from an in-memory operator in order to "export the buffered data"), and I believe if in those cases we actually did modify the window, then fakedist-disk logic test config would have crashed or returned incorrect results.

Splitting this b.Window(0, 1).Set(0, "long_string") out into:

c := b.Window(0, 1)
c.Set(0, "long_string")

Won't the call to c.Set assign the reallocated buffer to c.buffer but not b.buffer?

Anyway, I'm not requesting any changes to the code 😆 just agreeing with the current immutableness of Window.


pkg/col/coldata/bytes.go, line 304 at r9 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm. I think I wanted to maintain an invariant that all elements in b.elements[len:cap] range are uninitialized which makes Reset faster when len < cap.

Consider the following scenario with zero inlinable space:

b := NewBytes(2)
b.Set(0, "foo")
b.Set(1, "bar")
b.Set(0, "fooo")
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}

// Truncate b to length 1.
b.AppendSlice(b, 1, 0, 0)
// buffer = "foobarfooo"
// elements = {{offset: 6, len: 4}, {offset: 3, len: 3}}, len(elements) = 1

// Reset with zeroing out elements only up to len(b.elements)
b.Reset()
// buffer: len = 0, cap = 10
// elements = {{}, {offset: 3, len: 3}}, len(elements) = 1

b.Set(0, "zero")
// buffer = "zero", cap = 10
// elements = {{offset: 0, len: 4}, {offset: 3, len: 3}}, len(elements) = 1

b2 := NewBytes(1)
b2.Set(0, "one")
// Here we will get corruption because b.elements[1] thinks it can reuse
// buffer[3:6], but the zeroth element uses buffer[0:4].
b.AppendSlice(b2, 1 /* destIdx */, 0 /* startIdx */, 1 /* endIdx */)

If this is right, then we either have to zero out all elements after newLen in AppendSlice or to zero out all elements up to cap(b.elements) in Reset.


Update: this is, indeed, a problem, and I converted this snippet into a unit test. I also decided to switch to zeroing out elements up to the capacity in Reset.

Interesting! Thank you for the example, I wasn't thinking about Reset at all (or using AppendSlice for truncation, ha). Zeroing up to capacity in Reset makes sense to me.


pkg/col/coldata/bytes.go, line 137 at r14 (raw file):

	// Check if we there was an old non-inlined value we can overwrite.
	if !e.inlined && e.header.cap >= len(v) {
		copy(e.getNonInlined(b)[:len(v)], v)

nit: If element.getNonInlined took a length argument then we wouldn't need the [:len(v)] expression here. But maybe that's too finicky.

This commit changes the way `Bytes` vector is implemented in order to
support `Set` operation at arbitrary positions. Now, a `Bytes` vector is
a slice of `element`s.

`element` describes a single `[]byte` value. If a value doesn't exceed
30 bytes, then the whole value is stored within the `element`; otherwise,
the `element` "points" at a region in a common `Bytes.buffer` where the
non-inlined value is stored.

If the value is inlined, then the layout of `element` is used as follows:
```
           24-byte header | 6-byte padding
  element: .............................. | length | true
  Bytes.buffer: N/A
```
where 30 dots describe the inlinable space followed by a single byte for the
length followed by a boolean `true` indicating an inlined value.

If the value is non-inlined, then the layout of `element` is used as follows:
```
                                            padding
  element: .offset. | ..len... | ..cap... | xxxxxx | x | false
  Bytes.buffer: xxxxxxxx | offset .... | xxxxxxxx
```
where first 24 bytes contain our custom "header" of a byte slice that is
backed by `Bytes.buffer`. The following 7 bytes (the padding and the
`inlinedLength` byte) are not used, and then we have a boolean `false`
indicating a non-inlined value.

We have a custom "slice header" for the non-inlined values and don't
just use `[]byte` for two reasons:
1. we can easily overwrite the whole struct when a value is inlined without
causing any issues with the GC (because our slice header doesn't contain an
actual slice, GC doesn't have to track whether the backing array is live or
not);
2. our slice header remains valid even when `Bytes.buffer` is reallocated.
If we were to use `[]byte`, then whenever that reallocation occurred, we
would have to iterate over all non-inlined values and update the byte
slices to point to the new `Bytes.buffer`.

This new implementation was inspired by Velox library from Facebook
(https://facebookincubator.github.io/velox/develop/vectors.html#flat-vectors-scalar-types).

Our original implementation was based on the Arrow format, so we were
getting extremely cheap serialization/deserialization. With this change
some of the benchmarks have regressed significantly:
```
Append/bytes/AppendSimple/NullProbability=0.0-24    356MB/s ± 5%   221MB/s ± 1%    -37.81%  (p=0.000 n=9+8)
Append/bytes/AppendWithSel/NullProbability=0.0-24   241MB/s ± 2%   105MB/s ± 4%    -56.49%  (p=0.000 n=9+10)
Append/bytes/AppendSimple/NullProbability=0.2-24    354MB/s ± 6%   219MB/s ± 4%    -38.11%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24   229MB/s ± 3%    98MB/s ± 4%    -57.20%  (p=0.000 n=9+10)
Copy/bytes/CopySimple/NullProbability=0.0-24       39.7GB/s ± 1%   0.5GB/s ± 0%    -98.67%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       753MB/s ± 1%   544MB/s ± 0%    -27.83%  (p=0.000 n=10+8)
Copy/bytes/CopySimple/NullProbability=0.2-24       28.9GB/s ± 1%   0.5GB/s ± 0%    -98.19%  (p=0.000 n=8+10)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       634MB/s ± 0%   575MB/s ± 0%     -9.37%  (p=0.000 n=9+9)
```
```
ArrowBatchConverter/bytes/nullFraction=0.00/BatchToArrow-24   248GB/s ± 2%      3GB/s ± 3%     -98.90%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/BatchToArrow-24   240GB/s ± 2%      3GB/s ± 2%     -98.87%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.50/BatchToArrow-24   242GB/s ± 1%      3GB/s ± 2%     -98.86%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.00/ArrowToBatch-24   830GB/s ± 2%      3GB/s ± 2%     -99.59%  (p=0.000 n=10+10)
ArrowBatchConverter/bytes/nullFraction=0.25/ArrowToBatch-24   893GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+8)
ArrowBatchConverter/bytes/nullFraction=0.50/ArrowToBatch-24   910GB/s ± 1%      3GB/s ± 1%     -99.62%  (p=0.000 n=10+9)
```
I'm not particularly worried about the regression in the serialization
(since the absolute speeds are still very high for this to not be
a bottle neck), but the slowdown in `Append/Copy` benchmark is
worrisome. I do have some ideas of how to improve those operations to
reduce the regressions to acceptable levels, and I'll add those
optimizations in a follow-up commit.

As a proxy for change of performance of `Bytes` vector overall, I ran
`BenchmarkSortUUID` with results [here](https://gist.github.com/yuzefovich/baaf6b4096cb3e2f1504ad2b38431cc6)
and the unordered distinct benchmarks with results [here](https://gist.github.com/yuzefovich/d8b9166977ec56780cb1749e848f5935).
They show that the number of allocations has decreased in all cases
although the total allocated space increased in some cases with small
number of rows. As a result, there are some regressions in performance
on a small number of rows. Overall, results seem satisfactory.

Notably, as a result of this change the performance of top K sort has
improved drastically when bytes-like types are present in the input
columns (because `CopySlice` used in top K sorter no longer has to move
the data around).

It's worth calling out that we still use the same Arrow format for
serialization, so there is no need to bump the DistSQL version because
the change of this commit is purely to the in-memory representation.

An additional nice result of this change is that we no longer have to
maintain the non-decreasing offsets, which will allow us to remove calls
to `SetLength` in many places where the length of the batch is not
changed. This will be done in the follow-up commit.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.
This commit optimizes `Append/Copy` operations for `Bytes` vectors. The
main idea is that a new primitive `copy` is introduced which allows for
copying one `element` from one vector into another `element` of another
vector which removes the allocations when copying inlined values.

An additional improvement is preallocating enough `element`s to
copy/append into many values.
```
name                                               old speed      new speed       delta
Append/bytes/AppendSimple/NullProbability=0.0-24    226MB/s ± 7%    221MB/s ± 8%      ~     (p=0.143 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24   104MB/s ± 5%    189MB/s ± 1%   +81.57%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24    220MB/s ± 2%    229MB/s ± 7%    +4.23%  (p=0.004 n=8+10)
Append/bytes/AppendWithSel/NullProbability=0.2-24  98.4MB/s ± 5%  197.7MB/s ± 4%  +100.98%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.0-24        521MB/s ± 0%   2981MB/s ± 1%  +471.63%  (p=0.000 n=10+10)
Copy/bytes/CopyWithSel/NullProbability=0.0-24       539MB/s ± 0%   1020MB/s ± 0%   +89.32%  (p=0.000 n=10+10)
Copy/bytes/CopySimple/NullProbability=0.2-24        518MB/s ± 0%   2913MB/s ± 0%  +462.62%  (p=0.000 n=10+8)
Copy/bytes/CopyWithSel/NullProbability=0.2-24       560MB/s ± 0%    875MB/s ± 0%   +56.12%  (p=0.000 n=10+8)

name                                               old alloc/op   new alloc/op    delta
Append/bytes/AppendSimple/NullProbability=0.0-24     76.1kB ± 8%    118.9kB ± 7%   +56.23%  (p=0.000 n=10+10)
Append/bytes/AppendWithSel/NullProbability=0.0-24     192kB ± 2%      131kB ± 1%   -32.15%  (p=0.000 n=10+8)
Append/bytes/AppendSimple/NullProbability=0.2-24     78.0kB ± 0%    117.2kB ± 3%   +50.27%  (p=0.000 n=8+9)
Append/bytes/AppendWithSel/NullProbability=0.2-24     198kB ± 1%      105kB ± 3%   -47.21%  (p=0.000 n=10+9)
Copy/bytes/CopySimple/NullProbability=0.0-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.0-24         0.00B           0.00B           ~     (all equal)
Copy/bytes/CopySimple/NullProbability=0.2-24          0.00B           0.00B           ~     (all equal)
Copy/bytes/CopyWithSel/NullProbability=0.2-24         0.00B           0.00B           ~     (all equal)
```
I'm not sure what's up with some of the increase in allocations since
I don't see them locally on mac.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.
Since the previous commit made it so that `Bytes` vectors support Sets
at arbitrary positions, we no longer have to use `CopySlice` in the top
K sort implementation.

Release note: None

Release justification: low risk, high benefit changes to existing
functionality.
Previously, `Append`ing from the vector into itself was allowed when
`srcStartIdx == srcEndIdx` in order to be able to truncate the vector.
One of the previous commits removed the only place where this was
actually used (when implementing `Append` for Bytes-like types when
selection vector is present), so this commit removes this no-longer-used
feature.

Release note: None

Release justification: low risk cleanup.
Previously, we required that operators that might populate a `Bytes`
vector set the length on the output batch (which was needed in order to
maintain the invariant of non-decreasing offsets in the old
implementation of the `Bytes` vector) even if the length of the batch
didn't change. This is no longer needed given the updated
implementation, so this commit removes those redundant calls to
`SetLength`.

Release note: None

Release justification: low risk cleanup.
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews and the feedback!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis and @michae2)


pkg/col/coldata/bytes.go, line 186 at r9 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Splitting this b.Window(0, 1).Set(0, "long_string") out into:

c := b.Window(0, 1)
c.Set(0, "long_string")

Won't the call to c.Set assign the reallocated buffer to c.buffer but not b.buffer?

Anyway, I'm not requesting any changes to the code 😆 just agreeing with the current immutableness of Window.

Oh, indeed. Removed the TODO.


pkg/col/coldata/bytes.go, line 137 at r14 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: If element.getNonInlined took a length argument then we wouldn't need the [:len(v)] expression here. But maybe that's too finicky.

I refactored the code a bit to remove this.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2022

Build succeeded:

@craig craig bot merged commit fe889cf into cockroachdb:master Mar 9, 2022
@yuzefovich yuzefovich deleted the bytes-refactor branch March 9, 2022 22:38
// bytes value begins.
bufferOffset int
len int
cap int
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment, should we be explicit to use int64 here? Since the size of int is platform dependent. On 32-bit platform, this would mean the sliceHeader would be 12 bytes instead of 24 bytes intended.

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/col/coldata/bytes.go, line 77 at r18 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Drive by comment, should we be explicit to use int64 here? Since the size of int is platform dependent. On 32-bit platform, this would mean the sliceHeader would be 12 bytes instead of 24 bytes intended.

I thought about it and chose to keep using int.

Everything will still work correctly on 32-bit systems, we'll just have less inlinable space (18 bytes instead of 30) and we won't be using the whole size of the object in the Golang's size class (element will take up 20 bytes, and it'll be in 24 byte object class, so we could have used 4 bytes for more padding; alternatively, we could have used less padding to get into 16 byte class).

I think that 32-bit systems are quite rare nowadays, and we usually just assume that we're running on a 64-bit system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

colexec: poor performance of TopK sort when BYTES-like vectors are present

5 participants