Skip to content

Conversation

@fmoehler
Copy link
Contributor

@fmoehler fmoehler commented Feb 3, 2025

What is this change about?

We noticed that if bosh workers fail to startup for some reason, they go to sleep forever. This is due to the @retry_interval not being set here.
In our case the workers were not able to reach the database for some time, but instead of retrying they just went to sleep forever. Due to that the worker processes were still reported as running when doing monit summary . However, tasks where not picked up anymore, resulting in a long task queue. We could temporarily fix it by stopping and starting the director process:


monit stop director
monit start director

What tests have you run against this PR?

unit tests

How should this change be described in bosh release notes?

Fix bug leading to forever sleeping workers

Does this PR introduce a breaking change?

no

Tag your pair, your PM, and/or team!

@DennisAhaus

Copy link
Contributor

@a-hassanin a-hassanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great finding ... Thanks!

@rkoster rkoster requested review from a team, bgandon, jpalermo and klakin-pivotal and removed request for a team February 3, 2025 12:48
@ramonskie
Copy link
Contributor

looks like a sane thing to do.
but i am wondering about the underlying issues. the database.

@fmoehler
Copy link
Contributor Author

fmoehler commented Feb 3, 2025

This was an external hyperscaler database on gcp probably doing some maintenance.

@jpalermo jpalermo merged commit 84560ea into cloudfoundry:main Feb 3, 2025
9 checks passed
fmoehler added a commit to fmoehler/community that referenced this pull request Mar 6, 2025
I would like to change myself from reviewer to approver. You can find my contributions so far listed below:

# Foundational Infrastructure: VM deployment lifecycle (BOSH) Contributions
### PRs Commented on/Reviewed:
- 2024-12-12T22:52:53Z: [Remove `UseIsolatedEnv` from `system.Command`](cloudfoundry/bosh-utils#110)
- 2025-01-28T10:33:14Z: [enable ipv6 in boshSysctl](cloudfoundry/bosh-agent#347)
- 2025-02-10T07:47:30Z: [Extend bosh tagging for cpi](cloudfoundry/bosh#2603)
- 2025-02-26T16:50:01Z: [When removing recursors for Noble usage, we accidentally also removed handlers](cloudfoundry/bosh-dns-release#107)
### Issues that may be relevant:
- 2022-06-21T14:28:19Z: [Fix regression: Deployment manifests can contain yaml anchors](cloudfoundry/bosh#2386)
- 2023-03-23T12:33:02Z: [Update aws.md](cloudfoundry/docs-bosh#788)
- 2023-04-04T18:17:29Z: [Update post-stop.md](cloudfoundry/docs-bosh#789)
- 2023-04-13T07:47:12Z: [Update dns.md](cloudfoundry/docs-bosh#790)
- 2023-05-04T13:33:37Z: [correct the diskID for device path resolution](cloudfoundry/bosh-agent#309)
- 2023-05-04T14:22:24Z: [Device path resolution times out for aws and ali](cloudfoundry/bosh-agent#310)
- 2023-07-14T14:46:10Z: [introduce -v flag to show version information](cloudfoundry/bosh-azure-storage-cli#8)
- 2023-07-18T13:19:16Z: [auto bump the azure storage cli](cloudfoundry/bosh#2455)
- 2023-07-25T09:36:07Z: [update docs for azure storage account](cloudfoundry/docs-bosh#798)
- 2023-10-26T10:00:07Z: [append global flags to deploy cmd](cloudfoundry/bosh-cli#633)
- 2023-10-26T16:42:17Z: [Update configs.md](cloudfoundry/docs-bosh#806)
- 2023-10-31T16:19:11Z: [enable global flags for deploy command](cloudfoundry/bosh-cli#634)
- 2023-11-03T09:17:19Z: [bumped dependencies for nats-pure and ruby-prof to the most recent versions](cloudfoundry/bosh#2476)
- 2023-11-07T16:55:08Z: [update documentation](cloudfoundry/docs-bosh#810)
- 2023-11-09T12:57:03Z: [update global cli flags documentation](cloudfoundry/docs-bosh#811)
- 2023-12-12T13:53:01Z: [Include deploy config documentation](cloudfoundry/docs-bosh#816)
- 2023-12-13T16:33:11Z: [enable metrics for deploy config](cloudfoundry/bosh#2482)
- 2024-01-11T13:37:42Z: [fix typo](cloudfoundry/docs-bosh#818)
- 2024-01-11T13:44:44Z: [bosh scp command failing if used according to documentation](cloudfoundry/docs-bosh#819)
- 2024-02-14T13:05:36Z: [use octavia endpoint instead of neutron for lb activites](cloudfoundry/bosh-openstack-cpi-release#268)
- 2024-05-15T06:59:59Z: [bump fog openstack gem to new version 1.1.1 & enable application credential authentication](cloudfoundry/bosh-openstack-cpi-release#272)
- 2024-07-05T15:11:06Z: [bump ruby-prof version from 1.6.3 to 1.7.0](cloudfoundry/bosh#2535)
- 2024-08-07T13:25:05Z: [update gosigar version](cloudfoundry/bosh-dns-release#105)
- 2024-08-09T07:11:03Z: [fix formatting for configure blobstore for gcp section](cloudfoundry/docs-bosh#848)
- 2024-08-13T14:12:56Z: [introduce support for ceph blobstore](cloudfoundry/bosh-s3cli#43)
- 2024-08-28T08:27:36Z: [Isolated environment is hard-coded for building bosh-releases](cloudfoundry/bosh-cli#660)
- 2024-10-18T13:08:21Z: [Enable CPI Release golang](cloudfoundry/bosh-openstack-cpi-release#290)
- 2024-11-11T12:11:36Z: [link should point to main branch and not master branch](cloudfoundry/docs-bosh#853)
- 2024-11-13T20:34:24Z: [ensure that cpi uses the correct gem path during runtime](cloudfoundry/bosh-azure-cpi-release#702)
- 2024-11-13T20:51:42Z: [revert fastjsonparser](cloudfoundry/bosh-azure-cpi-release#703)
- 2024-11-13T21:04:07Z: ["Illegal instruction" error](cloudfoundry/bosh-azure-cpi-release#704)
- 2024-11-19T15:33:32Z: [Remove isolated environment from cpi and compiler cmd](cloudfoundry/bosh-cli#673)
- 2024-11-21T13:40:36Z: [Update pre-start.sh.erb in harden_ssh for noble stemcell](cloudfoundry/os-conf-release#70)
- 2024-11-26T19:30:35Z: [enable dual stack support](cloudfoundry/bosh-aws-cpi-release#172)
- 2024-12-05T10:11:36Z: [Metrics missing after upgrading to 280.1.10 ](cloudfoundry/bosh#2593)
- 2024-12-19T09:57:45Z: [Disable error messages related to networks/network interface cards](cloudfoundry/bosh-agent#344)
- 2024-12-19T10:12:41Z: [sysctl job disables ipv6 implicitly on jammy stemcells](cloudfoundry/os-conf-release#71)
- 2025-01-09T15:42:13Z: [Enable dual stack with single nic](cloudfoundry/bosh-aws-cpi-release#174)
- 2025-01-15T15:49:28Z: [Update 60-bosh-sysctl.conf](cloudfoundry/bosh-linux-stemcell-builder#407)
- 2025-01-28T10:33:14Z: [enable ipv6 in boshSysctl](cloudfoundry/bosh-agent#347)
- 2025-02-03T12:30:20Z: [set retry interval to avoid bosh workers sleeping forever in case of error](cloudfoundry/bosh#2599)
- 2025-02-07T11:40:47Z: [Update kernel_ipv6.go](cloudfoundry/bosh-agent#348)
- 2025-02-10T07:47:30Z: [Extend bosh tagging for cpi](cloudfoundry/bosh#2603)
- 2025-02-10T08:48:50Z: [Azure Compute Gallery integration](cloudfoundry/bosh-azure-cpi-release#709)
- 2025-02-10T08:50:20Z: [add ipv6 prefix](cloudfoundry/bosh-aws-cpi-release#176)
### Code contributions:
- 2023-03-23T12:32:37Z: [Update aws.md](cloudfoundry/docs-bosh@14888a5)
- 2023-04-04T18:17:16Z: [Update post-stop.md](cloudfoundry/docs-bosh@f8e4db2)
- 2023-04-13T07:47:00Z: [Update dns.md](cloudfoundry/docs-bosh@a51e96f)
- 2023-07-14T14:39:55Z: [introduce -v flag to show version information](cloudfoundry/bosh-azure-storage-cli@4fbf437)
- 2023-07-18T13:15:27Z: [auto bump the azure storage cli](cloudfoundry/bosh@c3b1cd0)
- 2023-07-18T13:19:24Z: [auto bump the azure storage cli](cloudfoundry/bosh@c3b1cd0)
- 2023-07-19T07:56:16Z: [introduce -v flag to show version information](cloudfoundry/bosh-azure-storage-cli@4fbf437)
- 2023-07-20T09:45:02Z: [add integration test](cloudfoundry/bosh-azure-storage-cli@844083d)
- 2023-07-25T09:35:44Z: [update docs for azure storage account](cloudfoundry/docs-bosh@df2d802)
- 2023-10-26T16:41:26Z: [Update configs.md](cloudfoundry/docs-bosh@db6957e)
- 2023-11-07T16:51:19Z: [update documentation](cloudfoundry/docs-bosh@06ea871)
- 2023-11-08T10:13:38Z: [add global flags to bosh deploy command](cloudfoundry/bosh-cli@cfbabbb)
- 2023-11-09T12:25:46Z: [add excluding of deployments and tests](cloudfoundry/bosh-cli@55961e4)
- 2023-11-09T12:56:53Z: [update global cli flags documentation](cloudfoundry/docs-bosh@2261b7a)
- 2023-11-09T15:20:32Z: [add excluding of deployments and tests](cloudfoundry/bosh-cli@55961e4)
- 2023-11-23T09:17:38Z: [color the cli output in case deployment flags are used](cloudfoundry/bosh-cli@f80c9e4)
- 2023-11-23T09:19:37Z: [rename the search function](cloudfoundry/bosh-cli@d7ae88a)
- 2023-11-23T16:00:47Z: [fix issues](cloudfoundry/bosh-cli@958d4aa)
- 2023-11-23T16:31:41Z: [fix some more issues](cloudfoundry/bosh-cli@1a8d04d)
- 2023-11-30T13:03:33Z: [Update cli-v2.md](cloudfoundry/docs-bosh@2d43295)
- 2023-11-30T13:03:54Z: [Update deploy-config.md](cloudfoundry/docs-bosh@a378c26)
- 2023-12-12T13:51:48Z: [Include deploy config documentation](cloudfoundry/docs-bosh@4c3a536)
- 2023-12-18T14:04:27Z: [Introduce monitoring metrics for deploy config enablement](cloudfoundry/bosh@1874004)
- 2023-12-18T16:21:04Z: [Update config_manager_spec.rb](cloudfoundry/bosh@0613612)
- 2023-12-18T16:47:44Z: [Update config_manager.rb](cloudfoundry/bosh@21fd8e1)
- 2024-01-03T08:26:56Z: [Update config_manager.rb](cloudfoundry/bosh@e66fea9)
- 2024-01-11T13:30:21Z: [fix typo](cloudfoundry/docs-bosh@c5535da)
- 2024-01-11T13:39:44Z: [We used bosh scp today and faced the issue mentioned here:](cloudfoundry/docs-bosh@6d2df8f)
- 2024-05-15T06:57:25Z: [bump fog openstack gem to new version 1.1.1](cloudfoundry/bosh-openstack-cpi-release@367524f)
- 2024-05-15T07:58:04Z: [enable application credential authentication](cloudfoundry/bosh-openstack-cpi-release@d12b287)
- 2024-05-15T12:37:16Z: [Update cloud.rb](cloudfoundry/bosh-openstack-cpi-release@f4cee7a)
- 2024-05-15T12:38:54Z: [Update cloud_spec.rb](cloudfoundry/bosh-openstack-cpi-release@dbdf465)
- 2024-07-05T14:40:37Z: [bump ruby-prof version from 1.6.3 to 1.7.0](cloudfoundry/bosh@6365a5d)
- 2024-07-09T12:37:17Z: [bump ruby-prof version from 1.6.3 to 1.7.0](cloudfoundry/bosh@6365a5d)
- 2024-08-07T13:24:53Z: [update gosigar version](cloudfoundry/bosh-dns-release@c1be1b9)
- 2024-08-08T08:33:14Z: [update gosigar version](cloudfoundry/bosh-dns-release@c1be1b9)
- 2024-08-09T06:40:52Z: [fix formatting for configure blobstore for gcp section](cloudfoundry/docs-bosh@68d35c4)
- 2024-11-11T12:11:17Z: [link should point to main branch and not master branch](cloudfoundry/docs-bosh@e919e42)
- 2024-11-12T08:10:05Z: [ensure that cpi uses the correct gem path during runtime](cloudfoundry/bosh-azure-cpi-release@c49f723)
- 2024-11-13T20:44:53Z: [revert fastjsonparser](cloudfoundry/bosh-azure-cpi-release@ed1bbcc)
- 2024-11-13T20:49:18Z: [revert fastjsonparser](cloudfoundry/bosh-azure-cpi-release@ed1bbcc)
- 2024-11-19T15:14:44Z: [Merge branch 'master' into revert-fastjsonparser](cloudfoundry/bosh-azure-cpi-release@f7f412d)
- 2024-11-19T15:37:05Z: [Merge branch 'main' into optional_isolated_environment](cloudfoundry/bosh-cli@72c44e6)
- 2024-11-19T15:39:06Z: [Update compiler.go](cloudfoundry/bosh-cli@2e81740)
- 2024-11-19T19:35:33Z: [fix env_factory](cloudfoundry/bosh-cli@8867168)
- 2024-11-19T19:44:10Z: [fix compiler.go](cloudfoundry/bosh-cli@eacc737)
- 2024-11-21T10:25:08Z: [fix compiler.go](cloudfoundry/bosh-cli@eacc737)
- 2024-11-21T13:40:25Z: [Update pre-start.sh.erb for noble stemcell](cloudfoundry/os-conf-release@2aab03c)
- 2024-11-21T16:09:16Z: [add test for compiler.go](cloudfoundry/bosh-cli@8f84216)
- 2024-12-18T08:57:31Z: [remove nix definitions](cloudfoundry/bosh-cli@bdf1b53)
- 2024-12-19T09:54:34Z: [do not raise error in case more networks than network interface cards are applied](cloudfoundry/bosh-agent@3c63d2b)
- 2024-12-30T12:22:59Z: [enable dual stack support on aws](cloudfoundry/bosh-aws-cpi-release@5cbfb80)
- 2025-01-02T15:02:59Z: [enable dual stack support on aws](cloudfoundry/bosh-aws-cpi-release@5cbfb80)
- 2025-01-08T08:39:04Z: [fix unit tests](cloudfoundry/bosh-agent@4e1db30)
- 2025-01-08T12:32:44Z: [create tests](cloudfoundry/bosh-aws-cpi-release@463ff5e)
- 2025-01-29T13:02:31Z: [remove isolated env](cloudfoundry/bosh-cli@ba77137)
- 2025-01-29T13:11:25Z: [remove isolated env](cloudfoundry/bosh-cli@ba77137)
- 2025-01-30T15:47:00Z: [Delete .tool-versions](cloudfoundry/bosh-cli@799e6e5)
- 2025-01-31T09:49:58Z: [Update opts.go](cloudfoundry/bosh-cli@8bfb6a6)
- 2025-02-03T12:26:38Z: [set retry interval to avoid bosh workers sleeping forever in case of error](cloudfoundry/bosh@23d5899)
- 2025-02-06T15:54:40Z: [enable ipv6 in boshSysctl (cloudfoundry#347)](cloudfoundry/bosh-agent@a7d6a6b)
- 2025-02-07T11:40:08Z: [Update kernel_ipv6.go](cloudfoundry/bosh-agent@a79e5fc)

# Foundational Infrastructure: Ali Cloud VM deployment lifecycle (BOSH) Contributions
### PRs Commented on/Reviewed:
### Issues that may be relevant:
- 2023-08-14T11:48:57Z: [Changing Tags from uppercase to lowercase ends in ](cloudfoundry/bosh-alicloud-cpi-release#162)
- 2023-11-23T11:24:07Z: [update dependencies](cloudfoundry/bosh-alicloud-cpi-release#165)
- 2023-11-28T17:20:05Z: [Update Documentation](cloudfoundry/bosh-alicloud-cpi-release#166)
- 2024-11-28T12:56:51Z: [use bash from PATH](cloudfoundry/bosh-alicloud-cpi-release#174)
- 2024-11-28T12:57:06Z: [bosh create-env terminates with: "cannot execute: required file not found"](cloudfoundry/bosh-alicloud-cpi-release#175)
### Code contributions:
- 2023-11-23T11:23:20Z: [update dependencies](cloudfoundry/bosh-alicloud-cpi-release@7f50084)
- 2023-11-28T17:18:16Z: [Update Documentation](cloudfoundry/bosh-alicloud-cpi-release@a72e168)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

7 participants