Skip to content

bench: fix colserde benchmarks#71892

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
cucaroach:arrow-noclear
Oct 25, 2021
Merged

bench: fix colserde benchmarks#71892
craig[bot] merged 1 commit intocockroachdb:masterfrom
cucaroach:arrow-noclear

Conversation

@cucaroach
Copy link
Copy Markdown
Contributor

@cucaroach cucaroach commented Oct 22, 2021

And optimization to aid the Go GC broke benchmarks of Serialize
and ArrowToBench, fix them by restoring the nil'd out arrays using a
copy.

Fixes: #71886

Release note: None

@cucaroach cucaroach requested a review from a team as a code owner October 22, 2021 21:16
@cucaroach cucaroach requested a review from yuzefovich October 22, 2021 21:16
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

		// as quickly as possible as we copy each (or simply reference each) by
		// coldata.Vecs below.
		if clearBuffer {

Now we have an additional conditional on the production path that wasn't there before. Likely it's a drop in the ocean, but I still think it's better to adjust the benchmark to perform the shallow copy rather than slow down, even a little, the production path. We'll need to backport the fix to 21.1 and 21.2 anyway, so if we make the benchmarks a bit slower on both branches, it's ok .

Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Now we have an additional conditional on the production path that wasn't there before. Likely it's a drop in the ocean, but I still think it's better to adjust the benchmark to perform the shallow copy rather than slow down, even a little, the production path. We'll need to backport the fix to 21.1 and 21.2 anyway, so if we make the benchmarks a bit slower on both branches, it's ok .

Why do we need to backport the fix the 21.1? Things work fine in that branch. I'm not in love with this, how about we take the data[i] = nil out completely and hoist it up to the callsite and clear the whole slice there? I doubt clearing as we go matters and doing it once should be more efficient and be cleaner. There's very few prod callers of these functions so it shouldn't be too bad. Another alternative would be to introduce a new API that does the nil'ing out so like this change but w/o the parameterization and per-loop branch.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):
I'm surprised that things work on 21.1 because the commit in question was backported there (#67609) - maybe the SHA you used for 21.1 was before that?

how about we take the data[i] = nil out completely and hoist it up to the callsite and clear the whole slice there? There's very few prod callers of these functions so it shouldn't be too bad.

I think there are two production callers that care about nilling things out, so we'd have to duplicate that clearing logic in two places (or use a helper function call).

I agree that clearing things as we go might not be the cleanest approach, but we definitely want to strive for removal of references as fast as possible, so I'm still not sold on the idea of modifying the production code to make the benchmarks happy. Is it a single benchmark that crashes or several?

Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

we definitely want to strive for removal of references as fast as possible

I bet dollars to donuts this doesn't matter as much as you think it does. Incremental GC's have write barriers so even if the marker sees the data in GC cycle N and then we nil it out in the next moment the slice will probably be marked to be re-visited so it will most likely catch the nil'ing in the same cycle. Furthermore the garbage is lazily swept so even if the GC free's this memory its a coin toss as to whether it will be available for immediate re-use (although mutator work usually drives the sweeper so heap hungry apps will quickly drive the sweeper to completion).

Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

we definitely want to strive for removal of references as fast as possible

I bet dollars to donuts this doesn't matter as much as you think it does. Incremental GC's have write barriers so even if the marker sees the data in GC cycle N and then we nil it out in the next moment the slice will probably be marked to be re-visited so it will most likely catch the nil'ing in the same cycle. Furthermore the garbage is lazily swept so even if the GC free's this memory its a coin toss as to whether it will be available for immediate re-use (although mutator work usually drives the sweeper so heap hungry apps will quickly drive the sweeper to completion).

Actually ignore that, that makes no sense. But still I don't think nil'ing out once per call and once per loop will make any difference unless the data slice can be huge. How big are these things typically?

Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Actually ignore that, that makes no sense. But still I don't think nil'ing out once per call and once per loop will make any difference unless the data slice can be huge. How big are these things typically?

Another pertinent question is what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

How big are these things typically?

We have two production use cases - sending the intermediate data between nodes (outbox-inbox path) or spilling the intermediate data to disk. In both cases we try to limit the size of coldata.Batch by distsql_workmem which is 64MiB by default, so I'm guessing that these data[i] combined could be up to 64MiB limit too. Note that we do have compression in place, but it is done on top of arrow format, so I don't think it saves us here (i.e. data[i] is already the de-compressed stuff).

what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?

What data copying are you referring to?

I think that we mostly use micro benchmarks to find significant regressions between releases, and we typically don't care that much about the absolute numbers.

Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

How big are these things typically?

We have two production use cases - sending the intermediate data between nodes (outbox-inbox path) or spilling the intermediate data to disk. In both cases we try to limit the size of coldata.Batch by distsql_workmem which is 64MiB by default, so I'm guessing that these data[i] combined could be up to 64MiB limit too. Note that we do have compression in place, but it is done on top of arrow format, so I don't think it saves us here (i.e. data[i] is already the de-compressed stuff).

what does it make the most sense to be benchmarking, does the benchmark version with the data copying more accurately reflect what goes on in prod?

What data copying are you referring to?

I think that we mostly use micro benchmarks to find significant regressions between releases, and we typically don't care that much about the absolute numbers.

I was referring to your version of the benchmark which copied the data, maybe that's more representative of the real workload? I'm curious how big the slice can get, like each element of the slice is an array, I'm guessing each array is pretty big and the slice itself is small?

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

I was referring to your version of the benchmark which copied the data, maybe that's more representative of the real workload? I'm curious how big the slice can get, like each element of the slice is an array, I'm guessing each array is pretty big and the slice itself is small?

len(data) is equal to the width of the batch which is equal to the number of columns touched by the query, so it's a relatively small number. However, each data[i] could point to a memory region that is MBs in size. What I was suggesting was to do a shallow copy - this would only involve copying len(data) pointers - it would not touch the large memory regions at all.

In terms of what would be more realistic - I think that shallow copy wouldn't matter in the grand scheme of things since it should be very fast.

And optimization to aid the Go GC broke benchmarks of Serialize
and ArrowToBench, fix them by restoring the nil'd out arrays using a
copy.

Fixes: cockroachdb#71886

Release note: None
@cucaroach cucaroach changed the title bench: fix arrowbatch benchmarks bench: fix colserde benchmarks Oct 25, 2021
@cucaroach cucaroach reopened this Oct 25, 2021
Copy link
Copy Markdown
Contributor Author

@cucaroach cucaroach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/col/colserde/arrowbatchconverter.go, line 326 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

len(data) is equal to the width of the batch which is equal to the number of columns touched by the query, so it's a relatively small number. However, each data[i] could point to a memory region that is MBs in size. What I was suggesting was to do a shallow copy - this would only involve copying len(data) pointers - it would not touch the large memory regions at all.

In terms of what would be more realistic - I think that shallow copy wouldn't matter in the grand scheme of things since it should be very fast.

+1

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

@cucaroach
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 25, 2021

Build succeeded:

@craig craig bot merged commit dbf5853 into cockroachdb:master Oct 25, 2021
@yuzefovich
Copy link
Copy Markdown
Member

blathers backport 21.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bench: arrow batch benchmarks broken

3 participants