Skip to content

Fatal error: concurrent map read and map write - CrashLoopBackOff #689

@ujjwal

Description

@ujjwal

1. Quick Debug Information

  • OS/Version - Ubuntu22.04
  • Kernel Version: 5.15.0-1045-gke
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
  • GPU Operator Version: 22.9.1

2. Issue or feature description

GPU-Operator has been having reporting an error fatal error: concurrent map read and map write and crash looping. This is happening sporadically and preventing new GPUs nodes to be added into the cluster.

{"level":"info","ts":1711652616.7035823,"logger":"controllers.ClusterPolicy","msg":"Reconciliate ClusterPolicies after node label update","nb":1}
{"level":"info","ts":1711652616.703655,"logger":"controllers.ClusterPolicy","msg":"Kubernetes version detected","version":"v1.27.10-gke.1055000"}
fatal error: concurrent map read and map write

goroutine 216 [running]:
k8s.io/apimachinery/pkg/runtime.(*Scheme).New(0xc0002401c0, {{0x1d7bcdf, 0xa}, {0x1d762e6, 0x2}, {0x1905e27, 0xd}})
	/workspace/vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:296 +0x65
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).objectTypeForListObject(0xc00049d710, {0x2073490?, 0xc0002cbb90})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:119 +0x3dd
sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).List(0xc00049d710, {0x206a408, 0xc00024cdc0}, {0x2073490, 0xc0002cbb90}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/cache/informer_cache.go:75 +0x65
sigs.k8s.io/controller-runtime/pkg/client.(*client).List(0xc0004b86c0, {0x206a408, 0xc00024cdc0}, {0x2073490?, 0xc0002cbb90?}, {0x2f8bbc0, 0x0, 0x0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:365 +0x4c5
github.com/NVIDIA/gpu-operator/controllers.addWatchNewGPUNode.func1({0x206a408, 0xc00024cdc0}, {0xc001a09e20?, 0x424f05?})
	/workspace/controllers/clusterpolicy_controller.go:264 +0x8c
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue(0xc00160db40?, {0x206a408?, 0xc00024cdc0?}, {0x2073cc0, 0xc0007463a0}, {0x20821a8?, 0xc000e2a440?}, 0xc00160dbc8?)
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:81 +0x59
sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).Create(0x206a408?, {0x206a408, 0xc00024cdc0}, {{0x20821a8?, 0xc000e2a440?}}, {0x2073cc0, 0xc0007463a0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/handler/enqueue_mapped.go:58 +0xe5
sigs.k8s.io/controller-runtime/pkg/internal/source.(*EventHandler).OnAdd(0xc0003c4140, {0x1d402e0?, 0xc000e2a440})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/source/event_handler.go:88 +0x27c
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
	/workspace/vendor/k8s.io/client-go/tools/cache/controller.go:243
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:973 +0x13e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0005ddf38?, {0x204fec0, 0xc001602000}, 0x1, 0xc001600000)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x3a6b222c7d7d7b3a?, 0x3b9aca00, 0x0, 0x69?, 0x227b3a227d225c67?)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0005d4990)
	/workspace/vendor/k8s.io/client-go/tools/cache/shared_informer.go:967 +0x69
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4f
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 181
	/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

goroutine 1 [select]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000622820, {0x206a408, 0xc000482aa0})
	/workspace/vendor/sigs.k8s.io/controller-runtime/pkg/manager/internal.go:509 +0x825
main.main()
	/workspace/main.go:176 +0xea8

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions