Describe the bug
We started seeing random crashes regarding liveness probes failing in out helm-operator installations. After looking into a profile taken from a running one we saw that the CPU and memory usage climb until the process itself is not responsive anymore.
Another behaviour we saw was that helm release objects get into pending-update state which we have to manually cleanup, i guess thats due to the stale "starting sync run"
To Reproduce
Steps to reproduce the behaviour:
- Install helm operator with cpu and memory limits of 500m and 256Mi
- Install some charts from helm stable charts
- See how the helm operator gets cleaned up by liveness probe leaving helm releases broken
Expected behavior
Not crashing and not corrupting helm releases
Logs
helm-operator logs:
W0524 08:49:49.261320 6 client_config.go:543] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
ts=2020-05-24T08:49:49.29648543Z caller=operator.go:82 component=operator info="setting up event handlers"
ts=2020-05-24T08:49:49.29653343Z caller=operator.go:98 component=operator info="event handlers set up"
ts=2020-05-24T08:49:49.296856635Z caller=main.go:287 component=helm-operator info="waiting for informer caches to sync"
ts=2020-05-24T08:49:49.397055732Z caller=main.go:292 component=helm-operator info="informer caches synced"
ts=2020-05-24T08:49:49.397139534Z caller=git.go:104 component=gitchartsync info="starting sync of git chart sources"
ts=2020-05-24T08:49:49.397141434Z caller=operator.go:110 component=operator info="starting operator"
ts=2020-05-24T08:49:49.397198135Z caller=operator.go:112 component=operator info="starting workers"
ts=2020-05-24T08:49:49.398192549Z caller=server.go:42 component=daemonhttp info="starting HTTP server on :3030"
ts=2020-05-24T08:49:49.398508054Z caller=release.go:75 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:49.398857159Z caller=release.go:75 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:50.189199369Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=0.10.1
ts=2020-05-24T08:49:50.989017121Z caller=release.go:249 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="running dry-run upgrade to compare with release version '5'" action=dry-run-compare
ts=2020-05-24T08:49:50.991875863Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:51.00369044Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:53.915386944Z caller=helm.go:69 component=helm version=v3 info="performing update for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.100451308Z caller=helm.go:69 component=helm version=v3 info="dry run for svcat" targetNamespace=svcat release=svcat
ts=2020-05-24T08:49:54.392106666Z caller=release.go:268 component=release release=svcat targetNamespace=svcat resource=svcat:helmrelease/svcat helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:49:54.392299469Z caller=release.go:75 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:49:58.510576689Z caller=release.go:105 component=release release=elasticsearch targetNamespace=graylog resource=graylog:helmrelease/elasticsearch helmVersion=v3 error="failed to determine sync action for release: status 'pending-upgrade' of release does not allow a safe upgrade"
ts=2020-05-24T08:49:58.511182698Z caller=release.go:75 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:04.356173295Z caller=release.go:249 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.356167495Z caller=release.go:249 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="running dry-run upgrade to compare with release version '1'" action=dry-run-compare
ts=2020-05-24T08:50:04.499243431Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.49980734Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:04.516823894Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:04.550429796Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.900198651Z caller=helm.go:69 component=helm version=v3 info="performing update for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.906840951Z caller=helm.go:69 component=helm version=v3 info="performing update for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:05.909939397Z caller=helm.go:69 component=helm version=v3 info="dry run for minio" targetNamespace=gitlab release=minio
ts=2020-05-24T08:50:05.916438994Z caller=helm.go:69 component=helm version=v3 info="dry run for gitlab-runner" targetNamespace=gitlab release=gitlab-runner
ts=2020-05-24T08:50:06.29659257Z caller=release.go:268 component=release release=gitlab-runner targetNamespace=gitlab resource=gitlab:helmrelease/gitlab-runner helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.303724677Z caller=release.go:75 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:06.503599862Z caller=release.go:268 component=release release=minio targetNamespace=gitlab resource=gitlab:helmrelease/minio helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:06.503798764Z caller=release.go:75 component=release release=chartmuseum targetNamespace=chartmuseum resource=chartmuseum:helmrelease/chartmuseum helmVersion=v3 info="starting sync run"
ts=2020-05-24T08:50:13.194666158Z caller=release.go:249 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="running dry-run upgrade to compare with release version '3'" action=dry-run-compare
ts=2020-05-24T08:50:14.195036991Z caller=helm.go:69 component=helm version=v3 info="preparing upgrade for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:14.488930578Z caller=helm.go:69 component=helm version=v3 info="resetting values to the chart's original version" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.392153912Z caller=helm.go:69 component=helm version=v3 info="performing update for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.691649982Z caller=helm.go:69 component=helm version=v3 info="dry run for osba" targetNamespace=osba release=osba
ts=2020-05-24T08:50:17.991267853Z caller=release.go:268 component=release release=osba targetNamespace=osba resource=osba:helmrelease/osba helmVersion=v3 info="no changes" phase=dry-run-compare
ts=2020-05-24T08:50:17.993421085Z caller=release.go:75 component=release release=graylog targetNamespace=graylog resource=graylog:helmrelease/graylog helmVersion=v3 info="starting sync run"
pprof top 10:
Showing top 10 nodes out of 371
flat flat% sum% cum cum%
1610ms 15.32% 15.32% 1610ms 15.32% runtime.memclrNoHeapPointers
850ms 8.09% 23.41% 850ms 8.09% math/big.addMulVVW
420ms 4.00% 27.40% 800ms 7.61% runtime.scanobject
290ms 2.76% 30.16% 1160ms 11.04% math/big.nat.montgomery
280ms 2.66% 32.83% 850ms 8.09% gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar
270ms 2.57% 35.39% 270ms 2.57% runtime.futex
270ms 2.57% 37.96% 270ms 2.57% syscall.Syscall
250ms 2.38% 40.34% 2950ms 28.07% runtime.mallocgc
220ms 2.09% 42.44% 220ms 2.09% runtime.memmove
170ms 1.62% 44.05% 230ms 2.19% encoding/json.checkValid
Additional context
Maybe related things
After some search i found this:
yaml/libyaml#111
yaml/libyaml#115
which leads me to believe that there is a serious issue in the YAML parsing part which can bring the whole application down without any notice
Current index.yaml from helm stable:
index.yaml.zip
Gitlab index.yaml
gitlab_index.yaml.zip
Describe the bug
We started seeing random crashes regarding liveness probes failing in out helm-operator installations. After looking into a profile taken from a running one we saw that the CPU and memory usage climb until the process itself is not responsive anymore.
Another behaviour we saw was that helm release objects get into pending-update state which we have to manually cleanup, i guess thats due to the stale "starting sync run"
To Reproduce
Steps to reproduce the behaviour:
Expected behavior
Not crashing and not corrupting helm releases
Logs
helm-operator logs:
pprof top 10:
Additional context
Maybe related things
After some search i found this:
yaml/libyaml#111
yaml/libyaml#115
which leads me to believe that there is a serious issue in the YAML parsing part which can bring the whole application down without any notice
Current index.yaml from helm stable:
index.yaml.zip
Gitlab index.yaml
gitlab_index.yaml.zip