Skip to content

kvserver: use non-cancelable contexts in raft application #75656

@tbg

Description

@tbg

Describe the problem

As of d064059, raft application on the proposal happens under the proposer's context, which may well be canceled before or mid-way through the application. However, raft application should not care about the cancellation, and we just diagnosed a bug where it did.

Also, we had problems with cancellation on the raft goroutine in the past and check against it, but evidently not well enough yet.

Expected behavior

The raft apply loop should use a ctx that is not cancelable. It makes sense to derive certain parts from the proposer - such as a trace span - but not the cancellation policy. That is, in

// createTracingSpans creates and assigns a new tracing span for each decoded
// command. If a command was proposed locally, it will be given a tracing span
// that follows from its proposal's span.
func (d *replicaDecoder) createTracingSpans(ctx context.Context) {
const opName = "raft application"
var it replicatedCmdBufSlice
for it.init(&d.cmdBuf); it.Valid(); it.Next() {
cmd := it.cur()
if cmd.IsLocal() {
cmd.ctx, cmd.sp = tracing.ChildSpan(cmd.proposal.ctx, opName)
} else if cmd.raftCmd.TraceData != nil {
// The proposal isn't local, and trace data is available. Extract
// the remote span and start a server-side span that follows from it.
spanMeta, err := d.r.AmbientContext.Tracer.ExtractMetaFrom(tracing.MapCarrier{
Map: cmd.raftCmd.TraceData,
})
if err != nil {
log.Errorf(ctx, "unable to extract trace data from raft command: %s", err)
} else {
cmd.ctx, cmd.sp = d.r.AmbientContext.Tracer.StartSpanCtx(
ctx,
opName,
// NB: Nobody is collecting the recording of this span; we have no
// mechanism for it.
tracing.WithRemoteParent(spanMeta),
tracing.WithFollowsFrom(),
)
}
} else {
cmd.ctx, cmd.sp = tracing.ChildSpan(ctx, opName)
}
}
}

we should not use a child context of cmd.proposal.ctx but create one derived from ctx (the worker ctx) or context.Background().

Additional data / screenshots

Should also write a test that uses a TestingPostApplyHook to ensure that the ctx is not cancelable, or at least that the cancellation doesn't carry over to the child. (Or more simply, that a value stashed in the proposer context is not visible at apply time).

@erikgrinaker didn't we put in some invariant check about this sometime recently?

Jira issue: CRDB-12769

Metadata

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions