Wait for all drain scripts to finish by manno · Pull Request #1302 · cloudfoundry-incubator/quarks-operator

manno · 2021-04-06T15:01:03Z

Motivation and Context

This adds a loop to wait for all other bpm containers, after the drain script has finished.

This draft adds the shared empty dir also to the init containers, even though they don't need it.

The number of methods was confusing.

Every container will touch a file below /mnt/drain-stamps after the drain script finished. It will then enter a loop, to delay the actual exit. Once the required numbers of drain stamps (=number of bpm processes) has been written, the drain script wrappers exit. This runs in parallel with TerminationGracePeriod, which the user user has to set to a an adequate value.

Still trying to figure out what is causing the failing integration test suite, where all tests pass. Instead of updating these tests, they were removed: * using deprecated ginkgo functionality * unclear what is being tested * no longer needed

According to the debug logs this test introduced the ginkgo node timeouts in CI. GinkgoRecover is needed to collect failures from a go routine. After adding "defer GinkgoRecover" to the drain test, it would sometimes hang indefinitely. The "Eventually" was added to make sure the async test exits. Also removed the "lifecycle" test. The lifecycle part had already been removed two years ago and it was now a duplicate of tests in deploy_test.go.

Test was flaky: drains-0 was already gone when trying to check logs. Relying on pod stdout has proven to always be flaky. The flakiness was hidden by the goroutines, which didn't use GinkgoRecover.

jandubois · 2021-04-12T19:53:00Z

Hi @manno!

I see you have merged this PR; are you going to make a quarks release with this change, or are you waiting for it to be tested from a dev build before you commit to a new release?

I'm hoping to see some confirmation from @univ0298 that it works as expected.

univ0298 · 2021-04-13T11:28:01Z

I was waiting to see a release but if that's not coming please let me know @manno thanks!

manno · 2021-04-13T11:38:04Z

Either way is fine for me :)
I'll create a release then.

manno · 2021-04-13T12:03:38Z

Ok, interesting. Seems like the helm chart artifact from CI (from https://github.com/cloudfoundry-incubator/quarks-operator/actions/runs/732318599) is not public.

The only thing visible are the docker images: https://github.com/users/cfcontainerizationbot/packages/container/package/quarks-operator-dev

The release might take a till tomorrow, I'll attach the dev helm chart here: helm chart.zip

manno · 2021-04-14T08:48:27Z

I just released https://github.com/cloudfoundry-incubator/quarks-operator/releases/tag/v7.2.2-0.g20bcb4c

univ0298 · 2021-04-19T19:46:33Z

@manno I don't think this is working. What I'm seeing is that if I set the terminationGracePeriod high to allow for drains to complete, the drains complete but still the pod runs, and only when the grace period is exhausted will it terminate the pod. So it seems we have ended up with waiting forever (well until the grace period limit) instead of detecting that all drains are complete. Is there any way I can try to debug things?

Here is what it looks like in the pod after all the drains have ended:

/:/var/vcap/jobs/garden# ls -latR /mnt/
/mnt/:
total 8
drwxr-xr-x 1 root root 4096 Apr 19 19:29 .
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 drain-stamps

/mnt/drain-stamps:
total 4
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 .

univ0298 · 2021-04-21T13:15:37Z

Discussed this with @manno yesterday. The most obvious issue is that there is a mixup in the current code where it's writing to /mnt/drain-done but it's mounted /tmp/drain-stamps

However there are other issues as well, working through them with @manno

manno added 2 commits April 6, 2021 15:55

Refactor container factory into smaller files

a9d18b0

The number of methods was confusing.

Clean up container factories volume building

8167b8e

manno force-pushed the drain-done-stamps-177254980 branch from 83f6b33 to 4cff530 Compare April 7, 2021 12:10

manno added 3 commits April 7, 2021 14:13

Bump ginkgo

24e730a

Remove two integration tests

adb8d3d

Still trying to figure out what is causing the failing integration test suite, where all tests pass. Instead of updating these tests, they were removed: * using deprecated ginkgo functionality * unclear what is being tested * no longer needed

manno force-pushed the drain-done-stamps-177254980 branch 4 times, most recently from cd9dcc5 to 47a3f87 Compare April 8, 2021 07:36

manno force-pushed the drain-done-stamps-177254980 branch from 8eb08af to 913f569 Compare April 8, 2021 15:20

manno added 2 commits April 8, 2021 18:01

Remove unused linters from Github workflows

6ae94ee

Simplify drain test

fa80464

Test was flaky: drains-0 was already gone when trying to check logs. Relying on pod stdout has proven to always be flaky. The flakiness was hidden by the goroutines, which didn't use GinkgoRecover.

manno force-pushed the drain-done-stamps-177254980 branch from d52a124 to fa80464 Compare April 9, 2021 08:00

manno changed the title ~~Drain done stamps~~ Wait for all drain scripts to finish Apr 9, 2021

manno merged commit 2d6a8bd into master Apr 9, 2021

manno deleted the drain-done-stamps-177254980 branch April 9, 2021 08:46

manno added the enhancement New feature or request label Apr 13, 2021

univ0298 mentioned this pull request Apr 19, 2021

POD draining does not work because individual containers terminate prematurely instead of waiting for the completion of all drain/preStops on all containers #1297

Closed

univ0298 mentioned this pull request Apr 26, 2021

Many Cloud Foundry drain scripts do not work in KubeCF due to no monit and no bpm cloudfoundry-incubator/kubecf#1728

Open

This was referenced May 4, 2021

Make sure terminated containers are considered as already drained #1308

Closed

Improve drain script support cloudfoundry-incubator/quarks-container-run#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for all drain scripts to finish#1302

Wait for all drain scripts to finish#1302
manno merged 8 commits intomasterfrom
drain-done-stamps-177254980

manno commented Apr 6, 2021 •

edited

Loading

Uh oh!

jandubois commented Apr 12, 2021

Uh oh!

univ0298 commented Apr 13, 2021

Uh oh!

manno commented Apr 13, 2021

Uh oh!

manno commented Apr 13, 2021

Uh oh!

manno commented Apr 14, 2021

Uh oh!

univ0298 commented Apr 19, 2021

Uh oh!

univ0298 commented Apr 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

manno commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Uh oh!

jandubois commented Apr 12, 2021

Uh oh!

univ0298 commented Apr 13, 2021

Uh oh!

manno commented Apr 13, 2021

Uh oh!

manno commented Apr 13, 2021

Uh oh!

manno commented Apr 14, 2021

Uh oh!

univ0298 commented Apr 19, 2021

Uh oh!

univ0298 commented Apr 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

manno commented Apr 6, 2021 •

edited

Loading