-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: Rangefeeds appear to be stuck #86818
Description
This has been observed in large scale deployments;
See https://github.com/cockroachlabs/support/issues/1729
The cause of changefeed/rangefeed stuckness is not understood. However, what should
never happen is the following stacks:
goroutine 2980779476 [select, 107 minutes]:
google.golang.org/grpc/internal/transport.(*recvBufferReader).readClient(0xc06d9f8cd0, {0xc04cda7a98, 0x5, 0x5})
google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:190 +0xaa
google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc06d9f8cd0, {0xc04cda7a98, 0xc06bf566f0, 0xc08a7d3128})
google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:170 +0x147
google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc016293e60, {0xc04cda7a98, 0xc08a7d31a0, 0xa4f2c7})
google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:484 +0x32
io.ReadAtLeast({0x62a59c0, 0xc016293e60}, {0xc04cda7a98, 0x5, 0x5}, 0x5)
GOROOT/src/io/io.go:328 +0x9a
io.ReadFull(...)
GOROOT/src/io/io.go:347
google.golang.org/grpc/internal/transport.(*Stream).Read(0xc00cf6cfc0, {0xc04cda7a98, 0x5, 0x5})
google.golang.org/grpc/internal/transport/external/org_golang_google_grpc/internal/transport/transport.go:468 +0xa5
google.golang.org/grpc.(*parser).recvMsg(0xc04cda7a88, 0x7fffffff)
google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:559 +0x47
google.golang.org/grpc.recvAndDecompress(0x58, 0xc00cf6cfc0, {0x0, 0x0}, 0x7fffffff, 0xc08a7d3458, {0x62e5030, 0x9b3b708})
google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:690 +0x66
google.golang.org/grpc.recv(0x62c4688, {0x7f92ab1e4980, 0xc000483a90}, 0x7f92a1156cf8, {0x0, 0x0}, {0x4d32d40, 0xc06ce1db60}, 0xb, 0xc08a7d3458, ...)
google.golang.org/grpc/external/org_golang_google_grpc/rpc_util.go:756 +0x6e
google.golang.org/grpc.(*csAttempt).recvMsg(0xc025850840, {0x4d32d40, 0xc06ce1db60}, 0x0)
google.golang.org/grpc/external/org_golang_google_grpc/stream.go:975 +0x2b0
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x0)
google.golang.org/grpc/external/org_golang_google_grpc/stream.go:826 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc020214b00, 0xc08a7d3590, 0xc08a7d3560)
google.golang.org/grpc/external/org_golang_google_grpc/stream.go:680 +0x2f6
google.golang.org/grpc.(*clientStream).RecvMsg(0xc020214b00, {0x4d32d40, 0xc06ce1db60})
google.golang.org/grpc/external/org_golang_google_grpc/stream.go:825 +0x11f
github.com/cockroachdb/cockroach/pkg/util/tracing.(*tracingClientStream).RecvMsg(0xc0398cfc60, {0x4d32d40, 0xc06ce1db60})
github.com/cockroachdb/cockroach/pkg/util/tracing/grpc_interceptor.go:440 +0x37
github.com/cockroachdb/cockroach/pkg/roachpb.(*internalRangeFeedClient).Recv(0xc07aff97c0)
github.com/cockroachdb/cockroach/pkg/roachpb/bazel-out/k8-opt/bin/pkg/roachpb/roachpb_go_proto_/github.com/cockroachdb/cockroach/pkg/roachpb/api.pb.go:9284 +0x4c
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).singleRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, {0x170dfa486f5eae51, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:465 +0xae3
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).partialRangeFeed(0xc00067ed80, {0x6345010, 0xc054e938c0}, 0xc08d0462a0, {{0xc03f46e980, 0xf, 0x10}, {0xc0148a2580, 0xf, 0x10}}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:315 +0x6fb
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).RangeFeed.func1.1({0x6345010, 0xc054e938c0})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go:110 +0xbe
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:169 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:54 +0x92
Dist sender should never be blocked for 107 minutes in (gRPC) Recv since each range should be producing either the events, or
range checkpoints (every kv.closed_timestamp.side_transport_interval interval).
This is a repeat of an issue we have observed about a year ago for the same customer.
It seems that this happens when there is significant activity happening, with range getting split/moved (possibly to different
nodes and/or stores). There seems to be some sort of a race where rangefeed is not disconnected; it kind of remains in the zombie state where there are matching go routines on the server side, but nothing is being emitted -- thus causing stuckness.
We should add defense in depth mechanism to dist sender (being worked on)
And we should also figure out what is going on.
Jira issue: CRDB-18946