sql: vectorized data encoding package#96321
Conversation
a939285 to
a75e7b0
Compare
|
This PR is just for the last commit. |
4e8cf32 to
787f997
Compare
eb1dcda to
2962e00
Compare
|
Hey @jordanlewis not looking for thorough review here, but thought it would be appropriate for you to give this a once over as someone with deep knowledge on encoding logic, if definitely turned out to be a little more complicated than I thought going in! Thanks! |
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm assuming since this isn't hooked up to the copy yet, the perf numbers aren't included, right?
No, that's the next patch.
pkg/col/coldata/nulls.go line 188 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: should we just call
n.NullAt(i)here? I'd assume that call would be inlined.
👍
pkg/sql/colenc/encode.go line 90 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this truncation should be called out in a comment to the method.
I decided it would be better to just make this an error.
pkg/sql/colenc/encode.go line 118 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This only checks that a non-nullable column is present in
colMap, but where do we check that each value in that column is non-NULL? Is it done later? Perhaps add a quick comment if so.
Done.
pkg/sql/colenc/encode.go line 148 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
This comment suggests that we should be reusing the slices above, but we do make a fresh allocation on each
initBufferscall. Is this expected? My guess is that we only need to allocate some of these ifb.rowsbecame larger on thisPrepareBatchcall.
👍
pkg/sql/colenc/encode.go line 180 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
getMaxKeyColsandgetMaxSuffixColumnsseem to return the same values throughout the encoding process (unless there is a schema change). Should we compute it once inMakeEncoder?
I think now that I reuse that space across PrepareBatch calls its fine the way it is.
pkg/sql/colenc/encode.go line 219 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
MaybeHasNulls, in theory, might returntrueif there are actually no NULLs in the vector. Not checking explicitly each relevant element can result in false positives. I'd guess it wouldn't be a noticeable slowdown to iterate over the relevant elements viaNullAtcheck.
Done.
pkg/sql/colenc/encode.go line 435 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we do
b.checkMemory()here?
We do it right away when we return from this function.
pkg/sql/colenc/encode.go line 648 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: rather than allocating new
DFloatandDDecimalobject on eachisCompositecall we could reuse the same ones. Perhaps this should be a method on some helper.
There's no heap allocations here:
~/vec-encoding [copy_demo|✚1…12] ❯❯❯ go test ./pkg/sql/colenc -gcflags="-m" 2>&1 | grep pkg/sql/colenc/encode.go:7
pkg/sql/colenc/encode.go:705:23: inlining call to tree.(*DFloat).IsComposite
pkg/sql/colenc/encode.go:705:23: inlining call to math.Float64bits
and
697ec01 to
89334a4
Compare
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/row/putter.go line 47 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we do a conversion like
key.(roachpb.Key)and use%sformat directive to match what we have inrow/inserter.go?
So what would we do if it wasn't a key? I'm inclined to leave it unless there's a reason to change it.
pkg/sql/row/putter.go line 81 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: not sure if it actually matters at all, but should we just inline one
roachpb.Valuedirectly intoTracePutterand then reuse that value across all calls?
Its stack allocated so I think that wouldn't help and might hurt.
pkg/sql/row/putter.go line 186 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should this be
InitPut?
Done.
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode_test.go line 150 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: consider adding a few tests for INT2 and INT4 types.
The Rand schema tests covered those cases well.
pkg/sql/colenc/encode_test.go line 181 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
The
EncodedKeyFamilywas introduced in #77501 to explicitly treat the index key of the inverted indexes as special (so that we don't attempt to decode it as bytes). It's just a light wrapper around[]byte, so I think you could just generate a randomDBytesand then wrap its contents inDEncodedKey.
Okay so do I have it right that we never write these things to a table are they exist purely when we read an inverted index value via IndexFetchSpec for execution time filter/join/aggregation purposes? And I should encode/decode them just like the row engine does for completeness even though its effectively dead code? I'm fine with that, just wanted to confirm.
89334a4 to
e93b5cd
Compare
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode_test.go line 181 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Okay so do I have it right that we never write these things to a table are they exist purely when we read an inverted index value via IndexFetchSpec for execution time filter/join/aggregation purposes? And I should encode/decode them just like the row engine does for completeness even though its effectively dead code? I'm fine with that, just wanted to confirm.
If I use a BYTES SQL type and pass a DEncodedKey in my test harness blows up because its expecting a DBytes. Should I just not test it, and should I remove support for it in encodeKeys?
339b16a to
2af96a8
Compare
|
Wow! Thanks so much for the thorough review @yuzefovich, feeling much better about this code now and test coverage is up to %88. Hopefully I responded adequately to all your feedback, let me know if I missed anything! |
2af96a8 to
4569657
Compare
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/key.go line 87 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think it might be worth extracting the type switch outside of the
forloop. This will result in the duplication of the loop three times but we'd avoid repetitive interface conversions and type switches on each row.
I'm actually leaning towards reverting this whole "push the for loop down" approach, its so ugly and we aren't CPU bound (here).
pkg/sql/colenc/key.go line 202 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
It'd be good to mention how
nullsare mutated.nit: missing period.
Done.
pkg/sql/colenc/value.go line 32 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
We do this interface conversion for each value in the column that we need to encode. Have you considered using
coldata.TypedVecsto do the conversion only once per column?
Not really, the experience pushing the loop below the interface calls for keys left a bad taste in my mouth. ;-)
pkg/sql/colenc/value.go line 54 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I don't think this matters much but
types.EncodedKeyFamilyshould also go in thiscase.
Like I decided for keys that would just be dead code I think.
pkg/sql/colenc/encode_test.go line 180 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think the support of JSON in
decodeTableKeyToColis dead code - currently it should never execute.
Done.
pkg/sql/colenc/encode_test.go line 601 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Is it ok that we always sort here and in
buildVecKVs? Consider leaving a comment about that.
Yeah we have to sort them both or the comparison algorithm doesn't work. I'll comment.
4569657 to
9743e6a
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I did another quick pass, and I think it looks good to me. I'll do another more careful pass once other commits merge.
Reviewed 5 of 10 files at r9, 14 of 21 files at r10, 2 of 4 files at r13, 4 of 4 files at r14, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @jordanlewis)
pkg/sql/colenc/encode.go line 648 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
There's no heap allocations here:
~/vec-encoding [copy_demo|✚1…12] ❯❯❯ go test ./pkg/sql/colenc -gcflags="-m" 2>&1 | grep pkg/sql/colenc/encode.go:7 pkg/sql/colenc/encode.go:705:23: inlining call to tree.(*DFloat).IsComposite pkg/sql/colenc/encode.go:705:23: inlining call to math.Float64bitsand
Ack, thanks for checking.
pkg/sql/colenc/encode.go line 54 at r14 (raw file):
// Map of index id to a slice of bools that contain partial index predicates. partialIndexes map[descpb.IndexID][]bool // Column ID that might contains composite encoded values.
nit: s/might contains/might contain/.
pkg/sql/colenc/encode.go line 94 at r14 (raw file):
// PrepareBatch encodes a subset of rows from the batch to the given row.Putter. // The maximum batch size is 64k (determined by NewMemBatchNoCols).
Hm, I actually forgot about this limit. It was introduced in #35349, and I don't think there is any fundamental reason for it other than being conservative and preventing engineers from shooting themselves in the foot. The only fundamental limitation AFAICT is that int of the selection vector must be able to address every row in the batch, so assuming 64 bit architecture, we can go pretty high. So if you think we need to have higher batch sizes for copy, we can do that.
An additional improvement to coldata.NewMemBatchNoCols that we can make is not allocating the selection vector at all. Copy's execution pipeline won't have any filters, so we don't need those selection vectors. Perhaps leave it as a TODO.
pkg/sql/colenc/encode.go line 156 at r14 (raw file):
} // Amount of space to reserve per
nit: incomplete comment.
pkg/sql/colenc/key.go line 87 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
I'm actually leaning towards reverting this whole "push the for loop down" approach, its so ugly and we aren't CPU bound (here).
Makes sense. Perhaps we're not CPU bound here in the Copy use case, but if / when we reuse this encoder logic elsewhere, that might not be the case, so I'd be in favor of keeping the current ugly-but-performant code. Up to you though if you want to pull the for loop out.
pkg/sql/colenc/key.go line 198 at r14 (raw file):
// encodeIndexKey is the vector version of rowenc.EncodeIndexKey. nulls will tell // the caller if theres null values in any of the columns and which rows as
nit: s/theres/there are/.
pkg/sql/row/putter.go line 47 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
So what would we do if it wasn't a key? I'm inclined to leave it unless there's a reason to change it.
I was thinking that %s with roachpb.Key.String would give us pretty representation whereas %v might not. When is this TracePutter used? Do we have any tests for it? Perhaps %v on interface{} already gives us the pretty representation.
pkg/sql/colenc/encode_test.go line 181 at r7 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
If I use a BYTES SQL type and pass a DEncodedKey in my test harness blows up because its expecting a DBytes. Should I just not test it, and should I remove support for it in encodeKeys?
Removing the code and any corresponding tests sounds good to me. It should be impossible for outside users to produce a datum of EncodedKeyFamily type, so support for it is not needed for Copy. If in the future we start using the vectorized encoder logic for internally-generated datums, then we might bring it back, but for now removing the support seems reasonable.
9743e6a to
9f04246
Compare
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colenc/encode.go line 94 at r14 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Hm, I actually forgot about this limit. It was introduced in #35349, and I don't think there is any fundamental reason for it other than being conservative and preventing engineers from shooting themselves in the foot. The only fundamental limitation AFAICT is that
intof the selection vector must be able to address every row in the batch, so assuming 64 bit architecture, we can go pretty high. So if you think we need to have higher batch sizes for copy, we can do that.An additional improvement to
coldata.NewMemBatchNoColsthat we can make is not allocating the selection vector at all. Copy's execution pipeline won't have any filters, so we don't need those selection vectors. Perhaps leave it as a TODO.
I'm not seeing any need to go beyond 64k at this time but good to know! I'll add a TODO where the coldata.Batch is allocated.
pkg/sql/colenc/key.go line 87 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Makes sense. Perhaps we're not CPU bound here in the Copy use case, but if / when we reuse this encoder logic elsewhere, that might not be the case, so I'd be in favor of keeping the current ugly-but-performant code. Up to you though if you want to pull the
forloop out.
I'm tempted to leave it as is and entertain the notion of code generated these switch statements if Go doesn't grow a usable macro system soon.
pkg/sql/row/putter.go line 47 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I was thinking that
%swithroachpb.Key.Stringwould give us pretty representation whereas%vmight not. When is thisTracePutterused? Do we have any tests for it? Perhaps%voninterface{}already gives us the pretty representation.
Yes I just put you on the review of the new copy tests that require TracePutter to work right.
pkg/sql/colenc/encode_test.go line 181 at r7 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Removing the code and any corresponding tests sounds good to me. It should be impossible for outside users to produce a datum of
EncodedKeyFamilytype, so support for it is not needed for Copy. If in the future we start using the vectorized encoder logic for internally-generated datums, then we might bring it back, but for now removing the support seems reasonable.
👍
ede7eab to
7795610
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
I have some more nits and I think I found a minor bug, but Nice work!
Reviewed 20 of 20 files at r15, 17 of 17 files at r16, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @jordanlewis)
pkg/sql/colenc/encode.go line 62 at r16 (raw file):
// column family. keys []roachpb.Key // Slice of value we can reuse across each call to Prepare and between each
nit: s/value/values/.
pkg/sql/colenc/encode.go line 73 at r16 (raw file):
// count is just end - start. start, end, count int mutationQuotaCheck func() bool
nit: perhaps it'll be more clear if this function returned an error (with a boolean you need to know whether true indicates "ok" or "quota exceeded").
pkg/sql/colenc/encode.go line 173 at r16 (raw file):
extraBuf := make([]byte, b.count*b.extraKeysBufSize) // Values are copied so we can re-use.
nit: I'm confused by this comment - what is being reused? We seem to be reallocating both b.values and valBuf. I guess the reuse is in comparison to what is reset in resetBuffers. Perhaps this comment should be moved into resetBuffers or just removed altogether?
pkg/sql/colenc/encode.go line 188 at r16 (raw file):
} } else { // If we are doing less row shrink slices.
nit: s/less row/less rows/.
pkg/sql/colenc/encode.go line 209 at r16 (raw file):
} func intMax(a, b int) int {
Doesn't this function currently return minimum?
pkg/sql/colenc/inverted.go line 62 at r16 (raw file):
var keys [][]byte val := invertedColToDatum(vec, row+b.start) indexGeoConfig := index.GetGeoConfig()
nit: this probably can be extracted outside of the for loop.
pkg/sql/colenc/key.go line 180 at r16 (raw file):
kys[r] = b } default:
nit: perhaps add an assertion here (behind buildutil.CrdbTestBuild flag) that the "canonical type family" of typ is typeconv.DatumVecCanonicalTypeFamily.
pkg/sql/row/putter.go line 77 at r15 (raw file):
func (t *TracePutter) CPutTuplesEmpty(kys []roachpb.Key, values [][]byte) { for i, k := range kys {
nit: this loop is effectively repeated in five places, consider extracting it into a helper.
pkg/sql/row/putter.go line 83 at r15 (raw file):
var v roachpb.Value v.SetTuple(values[i]) log.VEventfDepth(t.Ctx, 1, 2, "CPut %s -> %v", k, v.PrettyPrint())
nit: PrettyPrint returns a string, so %s seems more natural (here and four other places below) and consistent with CPutValuesEmpty.
pkg/sql/row/putter.go line 261 at r15 (raw file):
} // KVBatchAdapter implements putter interface and adapts it to kv.Batch API.
nit: s/putter/Putter/.
In order to avoid code duplication make parts the row encoding code public, mostly this is the RowHelper. Epic: CRDB-18892 Informs: cockroachdb#91831 Release note: None
colenc is a new package that allows kv.Batch's to be produced to encode tables using coldata.Batch's as the input. Every attempt was made to avoid code duplication and delegate to row encoding code where possible. Epic: CRDB-18892 Informs: cockroachdb#91831 Release note: None
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach, @jordanlewis, and @yuzefovich)
pkg/sql/colenc/encode.go line 73 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps it'll be more clear if this function returned an error (with a boolean you need to know whether
trueindicates "ok" or "quota exceeded").
Agreed
pkg/sql/colenc/encode.go line 173 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'm confused by this comment - what is being reused? We seem to be reallocating both
b.valuesandvalBuf. I guess the reuse is in comparison to what is reset inresetBuffers. Perhaps this comment should be moved intoresetBuffersor just removed altogether?
Between pk and each secondary index values isn't reallocated and across batches of rows if the num rows doesn't increase values is reused. Agreed this comment isn't helping.
pkg/sql/colenc/encode.go line 209 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Doesn't this function currently return minimum?
Wow, nice catch, that just eliminated %30 of the memory allocations from the BenchmarkTCPHLineItem:
BenchmarkTCPHLineItem-10 1340 889978 ns/op 209379 B/op 9310 allocs/op
vs.
BenchmarkTCPHLineItem-10 1390 856018 ns/op 267776 B/op 6982 allocs/op
pkg/sql/colenc/inverted.go line 62 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this probably can be extracted outside of the
forloop.
👍
pkg/sql/colenc/key.go line 180 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps add an assertion here (behind
buildutil.CrdbTestBuildflag) that the "canonical type family" oftypistypeconv.DatumVecCanonicalTypeFamily.
👍
pkg/sql/row/putter.go line 77 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this loop is effectively repeated in five places, consider extracting it into a helper.
There's 3 "tuples" loops, 2 "bytes" loops and 1 values loop. I think I'll just leave it.
pkg/sql/row/putter.go line 83 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
PrettyPrintreturns astring, so%sseems more natural (here and four other places below) and consistent withCPutValuesEmpty.
👍
|
bors r=yuzefovich |
|
bors retry |
|
Build succeeded: |

colenc is a new package that allows kv.Batch's to be produced to encode
tables using coldata.Batch's as the input. Every attempt was made to
avoid code duplication and delegate to row encoding code where possible.
Epic: CRDB-18892
Informs: #91831
Release note: None