Skip to content

[dnm] tracing: clean up use of auto collection across tasks#59391

Closed
tbg wants to merge 2 commits intocockroachdb:masterfrom
tbg:trace-leak-take2
Closed

[dnm] tracing: clean up use of auto collection across tasks#59391
tbg wants to merge 2 commits intocockroachdb:masterfrom
tbg:trace-leak-take2

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Jan 25, 2021

Just posting this so it doesn't fall off the radar.

  • stop: don't auto-collect forked spans
  • tracing: limit number of direct children
  • tracing: use manual collection in ForkCtxSpan

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

// sense for spans that extend past the lifetime of their
// parent. Nobody will collect the recording, but setting up
// a span is still helpful when an external tracer is set up.
ctx, sp = parentSp.Tracer().StartSpanCtx(ctx, taskName, tracing.WithParentAndManualCollection(parentSp.Meta()))
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfansharif this is actually obsolete with the other commit, which teaches ForkCtxSpan to do the right thing.

Unfortunately a lot of tracing-based tests break because they use this pattern:

sp := tr.StartSpan("foo")
stopper.RunAsyncTask("bar", func(ctx context.Context) {
  log.Event(ctx, "bar")
}
if !contains(sp.Recording(), "bar") { fail }

This is no longer reasonable because the span that contains bar is not a direct child of sp. Instead, I think such tests really want to use tracing as it is supposed to be used, where all spans are recorded somewhere and can be retrieved.

type RecordingCollector struct {
  mu struct {
    spans []*Span
  }
}

var rc tracing.RecordingCollector
sp := tr.StartSpan("foo", tracing.WithRecordingCollector(&rc))
// Child spans will all (also?) report to that collector, whether direct or through .Meta()

// ...
require.Contains(t, rc.GetRecording(), "foo")

Of course there are limits to in-memory collectors. They can't travel across process/RPC boundaries. So while Meta() could hold a reference to rc above, if we serialized it through an RPC we could not propagate it to the other end. But you could imagine collectors that could do it, for example a collector that is just an endpoint to which to send the recordings. That seems fine; we just need the collector to be able to tell the tracer how to serialize (if at all). There could even be hybrids: a local in-mem recording collector would serialize as an "send to this endpoint" collector.

The collector business would also replace the WithParentAnd{Auto,Manual}Collection options. The auto mode would be a WithParentCollector (i.e. report to the parent directly) and the manual mode is basically - no collection, and would be the default. The parent and the collection mode would be split, i.e. we'd add WithParent and WithParentMeta.

Curious what you think and whether there is a less invasive option we could use. We do need these tests to pass, unfortunately, but the changes in this PR are imo very worthwhile. We still leak 25mb of memory on long-running nodes because some long-running spans hold 1000 spans alive.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take it a step further, there's even a way to make tracing more idiomatic and do away with that manual ImportRemoteSpans stuff altogether.

Conceptually, what I would always have preferred is if the gateway node (or wherever the root span is) opened some kind of mailbox, and then all of the spans would report there. The problem I always saw with that was that you couldn't really tell when to stop waiting for new spans. But I think the problems we're seeing with the spans-as-record approach are pretty annoying too (think about how easy memory management could be if spans didn't hold on to each other and messages, etc, but just shipped things out to a central location!). Really the only benefit they give us is that the trace is done the moment the requests have executed and the root span finishes.

I don't think this can happen now, but it's worth keeping in mind as an ultimate goal. Curious what @andreimatei thinks.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/util/stop/stopper.go, line 350 at r1 (raw file):

// NB: we don't use auto collection because it does not make
// sense for spans that extend past the lifetime of their
// parent. Nobody will collect the recording, but setting up

It does make sense, though. I like seeing those [async] spans in recordings - even if the respective span might not be complete.

I generally like the existing "recording" model, where spans link to their children and if you have your hands on a root, you can get all of its recording. Speaking at a very high level, one of my main two or three problems with trace collectors is that it's hard to find what you're looking for, even when what you're looking for has been collected.

From a parent's perspective, there's not that much of a difference between a ChildOf child and a FollowsFrom child, in my opinion. There are perhaps some async tasks that are completely uninteresting to the parent, but others are interesting enough. I think the RunAsyncTask should offer an option of not creating a FollowsFrom child at all.

We still leak 25mb of memory on long-running nodes because some long-running spans hold 1000 spans alive.

And those 1000 children are async tasks?
Regardless, in my opinion, if a span is long-running, it runs the risk of birthing many children, sync or async. And so the solution for memory management should ideally be uniform across the two. Isn't a limit on the number of stored children good enough?

think about how easy memory management could be if spans didn't hold on to each other and messages

Would it be all that easy? If you want full traces to be available, you have the same problem: you need to tie together the lifetimes of different spans.

The spans created by ForkCtxSpan are expected to outlive the parent, so
automatic collection makes little sense here. It also causes the child
to be linked into the parent, which can delay memory reclamation; some
parent spans are effectively ephemeral.

As an example of this, consider this usage:

https://github.com/cockroachdb/cockroach/blob/e1f9e34e/pkg/kv/kvserver/replica_application_decoder.go#L162

The incoming context has a long-lived span, created below, and similar
problems will generally hold across all workers.

https://github.com/cockroachdb/cockroach/blob/8c5253bb/pkg/kv/kvserver/scheduler.go#L205-L204

Release note: None
Copy link
Copy Markdown
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL, I ran out of time so needs a bit of clean-up -- will get to it tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/util/stop/stopper.go, line 350 at r1 (raw file):

Curious what you think and whether there is a less invasive option we could use.

Short of introducing this in-memory collector (not disagreeing that it's a good idea, if anything it's a closer approximation to the centralized tracing storage prototype we were talking about elsewhere: #59310), I think we can do something simpler for the tests that want to assert on traces from forked spans -- in tests we can opt-in and configure the Tracer to simply link forked spans back to their parents. There are only two instances where we seem to need such a thing, so feels easier to do.

We still leak 25mb of memory on long-running nodes because some long-running spans hold 1000 spans alive.

Separately we can introduce a more aggressive per-span limit for select long-running spans that we know are problematic, building on top of #59370. Whatcha think?

It does make sense, though. I like seeing those [async] spans in recordings - even if the respective span might not be complete.

One other thing we can do to not regress on our story here, is to set the appropriate parent span ID for forked off spans (without needing to hold on to a reference to the parent, should be easy enough to do):

func (opts *spanOptions) parentSpanID() uint64 {
if opts.Parent != nil && !opts.Parent.isNoop() {
return opts.Parent.crdb.spanID
} else if opts.RemoteParent != nil {
return opts.RemoteParent.spanID
}
return 0
}

That way, and admittedly it's kind of annoying now, you could hit the vtable (#59492) to grab a handle on the live spans and include async spans that derived from it.

I generally like the existing "recording" model, where spans link to their children and if you have your hands on a root, you can get all of its recording.

Looking further ahead (not in 21.1), when we get to #59310, you could imagine grabbing a handle on the root node, and consulting the centralized storage for all spans that derived from the given parent's span ID (without needing to build out an in-memory graph of span relations). If the centralized store was memory bound, and internally dropped spans as needed (certainly older, Finish()-ed ones, but also live ones if needed), you could also retrieve recordings for Finish()-ed children.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/util/stop/stopper.go, line 350 at r1 (raw file):

We still leak 25mb of memory on long-running nodes because some long-running spans hold 1000 spans alive.

Separately we can introduce a more aggressive per-span limit for select long-running spans that we know are problematic, building on top of #59370. Whatcha think?

But why separately? Why are these async spans singled out for this treatment? What's the argument for not just treating them as regular children? In particular, what's the case that started all this? What long-running span was getting tons of these async children without also getting tons of regular children?

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think the LinkForkedSpans() option is a solid approach. It essentially makes tracing work "differently" in a very large class of tests (logic tests) and these also exercise the shape of traces, which is only conserved because of this magic option.

I think we should make a subscription option to the Tracer instead, where a func(*Span) is invoked every time a Span is removed from the tracer's registry. This function can then pull the recording if it wants to. In most of the explicit tests I see here, we would

a) set up the tracer with an interceptor
b) start a verbose span
c) set the interceptor to capture the span recordings for any spans with the right traceID
d) make the test work that way.

This isn't going to work the same way for the logic tests, so I'd like to know first if we will run into problems we can't solve there.

Also, the discussion with Andrei below makes me think that perhaps we should not be doing this now, if you think the alternative I proposed is reasonable.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @irfansharif)


pkg/util/stop/stopper.go, line 350 at r1 (raw file):

In particular, what's the case that started all this? What long-running span was getting tons of these async children without also getting tons of regular children?

The one that concretely made us revert the always-on tracing setting is this one:

} else {
cmd.ctx, cmd.sp = tracing.ForkCtxSpan(ctx, opName)
}

The span in ctx belonged to (*raftScheduler).worker and lived forever. It was holding on to an extra span with ~each raft command applied and ran us out of memory quickly. But this isn't to say that there aren't others. Note for example that we keep a tracing span open during ./cockroach start, which sets up lots of long-running moving parts, through RunAsyncTask. We can't affort to play whack-a-mole with this. I agree that memory accounting is still necessary either way, but adding memory accounting doesn't give us a license to play fast and loose.

Why are these async spans singled out for this treatment? What's the argument for not just treating them as regular children?

I would hold this the other way around - the special snowflake is the auto-collected span. A regular child is one across an RPC boundary, and you can't in-memory link the two. So why not do the same thing everywhere else? The only answer I have is that it's too much work to completely change trace collection now, especially with SQL's convoluted use and reuse of trace spans.

There is one other option though, which is that we dial back always-on tracing even more. We've already agreed that we're going to keep the noop span and the ForceRealSpan option around to limit the creation of real spans to where it is explicitly requested. We could additionally add a new recording type as indicated by this TODO:

const (
// RecordingOff means that the Span discards all events handed to it.
// Child spans created from it similarly won't be recording by default.
RecordingOff RecordingType = iota
// RecordingVerbose means that the Span is adding events passed in via LogKV
// and LogData to its recording and that derived spans will do so as well.
RecordingVerbose
// TODO(tbg): add RecordingBackground for always-on tracing.
)

And simply make sure that the SQL spans always have this recording type. This would keep the changes to which spans we create between 20.2 and 21.1 minimal while still achieving the goal of letting contention metadata flowing towards the gateway: we were already creating a real span for the gateway in 20.2, now the only difference is that its RPCs to the KV layer will also create spans on the KV side by default (that only propagate structured meta). I think we can achieve this without encoding the RecordingBackground mode on the wire: if a span is encoded, and the snowball baggage item isn't set, then the recording mode is RecordingBackground. (If the RecordingMode was RecordingOff, the span wouldn't ever have been put onto the wire).
This should be reasonably good, though not perfect, at avoiding leaks. Paired with per-local-root-span memory accounting, it's hopefully good enough.


pkg/util/tracing/tracer.go, line 648 at r3 (raw file):

	// XXX: Document. When forking, if parent span was told to link the child
	// span (primarily in tests), we should do it.
	collection := WithParentAndManualCollection(sp.Meta())

This one allocates, so better to flip it and default to the auto collection one and then override if necessary.


pkg/util/tracing/tracer.go, line 736 at r3 (raw file):

	ctx context.Context, opName string,
) (_ context.Context, getRecording func() Recording, cancel func()) {
	tr := NewTracer()

Huh, hadn't noticed that. That we are using a new tracer here is pretty much not acceptable even outside of the forking business (this tracer doesn't respect cluster settings). Can you phase out this method in favor of your new one (I'd just add a tracer into the signature here)?

@irfansharif
Copy link
Copy Markdown
Contributor

Superseded by #59815.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants