Skip to content

util/tracing: reduce allocations in EnsureForkSpan#76210

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:tracing.EnsureForkSpan-reduce-allocs
Feb 10, 2022
Merged

util/tracing: reduce allocations in EnsureForkSpan#76210
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:tracing.EnsureForkSpan-reduce-allocs

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

This patch pre-allocates capacity for the options slice in
EnsureForkSpan. Before, we were reallocating the slice storage
repeatedly, as it was growing from 0 to 1 and from 1 to 2 elements.

This patch also simplifies, and slighly speeds up, EnsureChildSpan. This
guy was using a sync.Pool for its options slice. That is not necessary,
since the slice storage does not escape.

In the benchmarks below, detached-child=false is EnsureChildSpan and
detached-child=true is EnsureForkSpan.

make bench PKG=./pkg/util/tracing BENCHES=BenchmarkSpanCreation TESTFLAGS="-count=5 -cpu=1,2,4,8,16"
name old time/op new time/op delta
SpanCreation/detached-child=false 1.70µs ± 1% 1.61µs ± 1% -5.06% (p=0.008 n=5+5)
SpanCreation/detached-child=false-2 1.17µs ± 7% 1.04µs ± 3% -10.70% (p=0.008 n=5+5)
SpanCreation/detached-child=false-4 958ns ± 2% 893ns ±10% -6.76% (p=0.008 n=5+5)
SpanCreation/detached-child=false-8 1.11µs ±38% 0.95µs ± 1% ~ (p=0.310 n=5+5)
SpanCreation/detached-child=false-16 925ns ± 1% 919ns ± 1% ~ (p=0.286 n=4+5)
SpanCreation/detached-child=true 2.27µs ± 2% 1.88µs ± 0% -17.27% (p=0.008 n=5+5)
SpanCreation/detached-child=true-2 1.82µs ±10% 1.33µs ±22% -26.81% (p=0.008 n=5+5)
SpanCreation/detached-child=true-4 1.29µs ± 2% 1.02µs ± 1% -21.35% (p=0.008 n=5+5)
SpanCreation/detached-child=true-8 1.16µs ± 1% 0.91µs ± 5% -21.56% (p=0.008 n=5+5)
SpanCreation/detached-child=true-16 1.14µs ± 2% 0.97µs ± 1% -15.08% (p=0.008 n=5+5)

name old alloc/op new alloc/op delta
SpanCreation/detached-child=false 288B ± 0% 288B ± 0% ~ (all equal)
SpanCreation/detached-child=false-2 288B ± 0% 288B ± 0% ~ (all equal)
SpanCreation/detached-child=false-4 288B ± 0% 288B ± 0% ~ (all equal)
SpanCreation/detached-child=false-8 288B ± 0% 288B ± 0% ~ (all equal)
SpanCreation/detached-child=false-16 288B ± 0% 288B ± 0% ~ (all equal)
SpanCreation/detached-child=true 768B ± 0% 288B ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-2 768B ± 0% 288B ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-4 768B ± 0% 288B ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-8 768B ± 0% 288B ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-16 769B ± 0% 288B ± 0% -62.55% (p=0.008 n=5+5)

name old allocs/op new allocs/op delta
SpanCreation/detached-child=false 6.00 ± 0% 6.00 ± 0% ~ (all equal)
SpanCreation/detached-child=false-2 6.00 ± 0% 6.00 ± 0% ~ (all equal)
SpanCreation/detached-child=false-4 6.00 ± 0% 6.00 ± 0% ~ (all equal)
SpanCreation/detached-child=false-8 6.00 ± 0% 6.00 ± 0% ~ (all equal)
SpanCreation/detached-child=false-16 6.00 ± 0% 6.00 ± 0% ~ (all equal)
SpanCreation/detached-child=true 16.0 ± 0% 6.0 ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-2 16.0 ± 0% 6.0 ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-4 16.0 ± 0% 6.0 ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-8 16.0 ± 0% 6.0 ± 0% -62.50% (p=0.008 n=5+5)
SpanCreation/detached-child=true-16 16.0 ± 0% 6.0 ± 0% -62.50% (p=0.008 n=5+5)

Release note: None

This patch pre-allocates capacity for the options slice in
EnsureForkSpan. Before, we were reallocating the slice storage
repeatedly, as it was growing from 0 to 1 and from 1 to 2 elements.

This patch also simplifies, and slighly speeds up, EnsureChildSpan. This
guy was using a sync.Pool for its options slice. That is not necessary,
since the slice storage does not escape.

In the benchmarks below, detached-child=false is EnsureChildSpan and
detached-child=true is EnsureForkSpan.

> make bench PKG=./pkg/util/tracing BENCHES=BenchmarkSpanCreation TESTFLAGS="-count=5 -cpu=1,2,4,8,16"
name                                  old time/op    new time/op    delta
SpanCreation/detached-child=false       1.70µs ± 1%    1.61µs ± 1%   -5.06%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-2     1.17µs ± 7%    1.04µs ± 3%  -10.70%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-4      958ns ± 2%     893ns ±10%   -6.76%  (p=0.008 n=5+5)
SpanCreation/detached-child=false-8     1.11µs ±38%    0.95µs ± 1%     ~     (p=0.310 n=5+5)
SpanCreation/detached-child=false-16     925ns ± 1%     919ns ± 1%     ~     (p=0.286 n=4+5)
SpanCreation/detached-child=true        2.27µs ± 2%    1.88µs ± 0%  -17.27%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-2      1.82µs ±10%    1.33µs ±22%  -26.81%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-4      1.29µs ± 2%    1.02µs ± 1%  -21.35%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-8      1.16µs ± 1%    0.91µs ± 5%  -21.56%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-16     1.14µs ± 2%    0.97µs ± 1%  -15.08%  (p=0.008 n=5+5)

name                                  old alloc/op   new alloc/op   delta
SpanCreation/detached-child=false         288B ± 0%      288B ± 0%     ~     (all equal)
SpanCreation/detached-child=false-2       288B ± 0%      288B ± 0%     ~     (all equal)
SpanCreation/detached-child=false-4       288B ± 0%      288B ± 0%     ~     (all equal)
SpanCreation/detached-child=false-8       288B ± 0%      288B ± 0%     ~     (all equal)
SpanCreation/detached-child=false-16      288B ± 0%      288B ± 0%     ~     (all equal)
SpanCreation/detached-child=true          768B ± 0%      288B ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-2        768B ± 0%      288B ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-4        768B ± 0%      288B ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-8        768B ± 0%      288B ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-16       769B ± 0%      288B ± 0%  -62.55%  (p=0.008 n=5+5)

name                                  old allocs/op  new allocs/op  delta
SpanCreation/detached-child=false         6.00 ± 0%      6.00 ± 0%     ~     (all equal)
SpanCreation/detached-child=false-2       6.00 ± 0%      6.00 ± 0%     ~     (all equal)
SpanCreation/detached-child=false-4       6.00 ± 0%      6.00 ± 0%     ~     (all equal)
SpanCreation/detached-child=false-8       6.00 ± 0%      6.00 ± 0%     ~     (all equal)
SpanCreation/detached-child=false-16      6.00 ± 0%      6.00 ± 0%     ~     (all equal)
SpanCreation/detached-child=true          16.0 ± 0%       6.0 ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-2        16.0 ± 0%       6.0 ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-4        16.0 ± 0%       6.0 ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-8        16.0 ± 0%       6.0 ± 0%  -62.50%  (p=0.008 n=5+5)
SpanCreation/detached-child=true-16       16.0 ± 0%       6.0 ± 0%  -62.50%  (p=0.008 n=5+5)

Release note: None
@andreimatei andreimatei requested review from a team and nvb February 7, 2022 23:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


-- commits, line 15 at r1:
for my learning: it best practice to benchmark with many cpu values like this? what value does running with different values for -cpu values offer here? is the goal to show the consistency of the impact?gh


pkg/util/tracing/tracer.go, line 1496 at r1 (raw file):

) (context.Context, *Span) {
	// NB: Making the capacity dynamic, based on len(os), makes this allocate.
	// With a fixed length, it doesn't allocate.

trying to understand how this works...when you call make with a fixed integer length, no allocation is done, and the allocation happens when you append for the first time? but what if you called make([]SpanOption, 0, len(os)+1) wouldn't that give you exactly the capacity you need below?

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @nvanbenschoten)


-- commits, line 15 at r1:

Previously, dhartunian (David Hartunian) wrote…

for my learning: it best practice to benchmark with many cpu values like this? what value does running with different values for -cpu values offer here? is the goal to show the consistency of the impact?gh

For benchmarks that have been programmed for parallelism (b.RunParallel(func))), varying the number of cpus varies the concurrency. So, for example, you can see changes that only help or hurt at high concurrency. In this particular case, I probably could have run with a single cpu count; there's not much interesting happening as you vary it since span creation doesn't scale beyond two cores.


pkg/util/tracing/tracer.go, line 1496 at r1 (raw file):

when you call make with a fixed integer length, no allocation is done, and the allocation happens when you append for the first time?

When you use a fixed length/capacity, if that size is small enough and also if the storage for the slice doesn't need to escape to the heap (i.e. nobody's referencing slice elements beyond the lifetime of the function), then the storage will be allocated on the stack. append will not allocate while there's available capacity; it'll allocate once the original capacity is exhausted.

but what if you called make([]SpanOption, 0, len(os)+1) wouldn't that give you exactly the capacity you need below?

That would give me the capacity I want, but the storage would be allocated on the heap. As far as I can tell, the go compiler never allocates dynamically-sized objects on the stack. There's no alloca like in C.
That's why I made this capacity static, in the hope that it's not often exceeded.

@dhartunian dhartunian self-requested a review February 9, 2022 21:52
Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: TFTExplanation :)

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @nvanbenschoten)

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @nvanbenschoten)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 10, 2022

Build succeeded:

@craig craig bot merged commit 9123557 into cockroachdb:master Feb 10, 2022
@andreimatei andreimatei deleted the tracing.EnsureForkSpan-reduce-allocs branch February 10, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants