bench: fix colserde benchmarks#71892
Conversation
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
// as quickly as possible as we copy each (or simply reference each) by // coldata.Vecs below. if clearBuffer {
Now we have an additional conditional on the production path that wasn't there before. Likely it's a drop in the ocean, but I still think it's better to adjust the benchmark to perform the shallow copy rather than slow down, even a little, the production path. We'll need to backport the fix to 21.1 and 21.2 anyway, so if we make the benchmarks a bit slower on both branches, it's ok .
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Now we have an additional conditional on the production path that wasn't there before. Likely it's a drop in the ocean, but I still think it's better to adjust the benchmark to perform the shallow copy rather than slow down, even a little, the production path. We'll need to backport the fix to 21.1 and 21.2 anyway, so if we make the benchmarks a bit slower on both branches, it's ok .
Why do we need to backport the fix the 21.1? Things work fine in that branch. I'm not in love with this, how about we take the data[i] = nil out completely and hoist it up to the callsite and clear the whole slice there? I doubt clearing as we go matters and doing it once should be more efficient and be cleaner. There's very few prod callers of these functions so it shouldn't be too bad. Another alternative would be to introduce a new API that does the nil'ing out so like this change but w/o the parameterization and per-loop branch.
05e0d64 to
1110c35
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
I'm surprised that things work on 21.1 because the commit in question was backported there (#67609) - maybe the SHA you used for 21.1 was before that?
how about we take the data[i] = nil out completely and hoist it up to the callsite and clear the whole slice there? There's very few prod callers of these functions so it shouldn't be too bad.
I think there are two production callers that care about nilling things out, so we'd have to duplicate that clearing logic in two places (or use a helper function call).
I agree that clearing things as we go might not be the cleanest approach, but we definitely want to strive for removal of references as fast as possible, so I'm still not sold on the idea of modifying the production code to make the benchmarks happy. Is it a single benchmark that crashes or several?
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
we definitely want to strive for removal of references as fast as possible
I bet dollars to donuts this doesn't matter as much as you think it does. Incremental GC's have write barriers so even if the marker sees the data in GC cycle N and then we nil it out in the next moment the slice will probably be marked to be re-visited so it will most likely catch the nil'ing in the same cycle. Furthermore the garbage is lazily swept so even if the GC free's this memory its a coin toss as to whether it will be available for immediate re-use (although mutator work usually drives the sweeper so heap hungry apps will quickly drive the sweeper to completion).
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
we definitely want to strive for removal of references as fast as possible
I bet dollars to donuts this doesn't matter as much as you think it does. Incremental GC's have write barriers so even if the marker sees the data in GC cycle N and then we nil it out in the next moment the slice will probably be marked to be re-visited so it will most likely catch the nil'ing in the same cycle. Furthermore the garbage is lazily swept so even if the GC free's this memory its a coin toss as to whether it will be available for immediate re-use (although mutator work usually drives the sweeper so heap hungry apps will quickly drive the sweeper to completion).
Actually ignore that, that makes no sense. But still I don't think nil'ing out once per call and once per loop will make any difference unless the data slice can be huge. How big are these things typically?
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Actually ignore that, that makes no sense. But still I don't think nil'ing out once per call and once per loop will make any difference unless the data slice can be huge. How big are these things typically?
Another pertinent question is what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
How big are these things typically?
We have two production use cases - sending the intermediate data between nodes (outbox-inbox path) or spilling the intermediate data to disk. In both cases we try to limit the size of coldata.Batch by distsql_workmem which is 64MiB by default, so I'm guessing that these data[i] combined could be up to 64MiB limit too. Note that we do have compression in place, but it is done on top of arrow format, so I don't think it saves us here (i.e. data[i] is already the de-compressed stuff).
what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?
What data copying are you referring to?
I think that we mostly use micro benchmarks to find significant regressions between releases, and we typically don't care that much about the absolute numbers.
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
How big are these things typically?
We have two production use cases - sending the intermediate data between nodes (outbox-inbox path) or spilling the intermediate data to disk. In both cases we try to limit the size of
coldata.Batchbydistsql_workmemwhich is 64MiB by default, so I'm guessing that thesedata[i]combined could be up to 64MiB limit too. Note that we do have compression in place, but it is done on top ofarrowformat, so I don't think it saves us here (i.e.data[i]is already the de-compressed stuff).what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?
What data copying are you referring to?
I think that we mostly use micro benchmarks to find significant regressions between releases, and we typically don't care that much about the absolute numbers.
I was referring to your version of the benchmark which copied the data, maybe that's more representative of the real workload? I'm curious how big the slice can get, like each element of the slice is an array, I'm guessing each array is pretty big and the slice itself is small?
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
I was referring to your version of the benchmark which copied the data, maybe that's more representative of the real workload? I'm curious how big the slice can get, like each element of the slice is an array, I'm guessing each array is pretty big and the slice itself is small?
len(data) is equal to the width of the batch which is equal to the number of columns touched by the query, so it's a relatively small number. However, each data[i] could point to a memory region that is MBs in size. What I was suggesting was to do a shallow copy - this would only involve copying len(data) pointers - it would not touch the large memory regions at all.
In terms of what would be more realistic - I think that shallow copy wouldn't matter in the grand scheme of things since it should be very fast.
1110c35 to
c1eefa0
Compare
And optimization to aid the Go GC broke benchmarks of Serialize and ArrowToBench, fix them by restoring the nil'd out arrays using a copy. Fixes: cockroachdb#71886 Release note: None
cucaroach
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
len(data)is equal to the width of the batch which is equal to the number of columns touched by the query, so it's a relatively small number. However, eachdata[i]could point to a memory region that is MBs in size. What I was suggesting was to do a shallow copy - this would only involve copyinglen(data)pointers - it would not touch the large memory regions at all.In terms of what would be more realistic - I think that shallow copy wouldn't matter in the grand scheme of things since it should be very fast.
+1
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)
|
bors r+ |
|
Build succeeded: |
|
blathers backport 21.2 |
And optimization to aid the Go GC broke benchmarks of Serialize
and ArrowToBench, fix them by restoring the nil'd out arrays using a
copy.
Fixes: #71886
Release note: None