Skip to content

bulkio: error from ingestKvs causes deadlock #49989

@nvb

Description

@nvb

Running with #49979, I'm still seeing the stall that the change was hoping to fix.

Here's a dump of the goroutines: https://gist.github.com/nvanbenschoten/3cffff71d9847e09e366bfaeed0f15b4.

Notably, we still have a collection of importWorkers all stuck in (*DatumRowConverter).SendBatch. However, we have no one pulling from this channel. After reading through this code a bit, I believe this channel is supposed to be consumed by this goroutine:

group.GoCtx(func(ctx context.Context) error {
summary, err = ingestKvs(ctx, flowCtx, spec, progCh, kvCh)
if err != nil {
return err
}
var prog execinfrapb.RemoteProducerMetadata_BulkProcessorProgress
prog.ResumePos = make(map[int32]int64)
prog.CompletedFraction = make(map[int32]float32)
for i := range spec.Uri {
prog.CompletedFraction[i] = 1.0
prog.ResumePos[i] = math.MaxInt64
}
progCh <- prog
return nil
})

That goroutine is not in the stacktrace. runImport is pretty suspicious. It links together two ctxgroup instances here:

// This group links together the producers (via producerGroup) and the KV ingester.
group := ctxgroup.WithContext(ctx)
group.Go(func() error {
defer close(kvCh)
return producerGroup.Wait()
})

However, it only seems to care about producerGroup hitting an error before group (consumerGroup), not the other way around. If ingestKvs throws an error, I don't see anything that would stop this from deadlocking.

Metadata

Metadata

Assignees

Labels

A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions