Skip to content
This repository was archived by the owner on Oct 22, 2021. It is now read-only.

Wait for all drain scripts to finish#1302

Merged
manno merged 8 commits intomasterfrom
drain-done-stamps-177254980
Apr 9, 2021
Merged

Wait for all drain scripts to finish#1302
manno merged 8 commits intomasterfrom
drain-done-stamps-177254980

Conversation

@manno
Copy link
Member

@manno manno commented Apr 6, 2021

Motivation and Context

This adds a loop to wait for all other bpm containers, after the drain script has finished.

#177254980

This draft adds the shared empty dir also to the init containers, even though they don't need it.

Fixes #1297

@manno manno force-pushed the drain-done-stamps-177254980 branch from 83f6b33 to 4cff530 Compare April 7, 2021 12:10
manno added 3 commits April 7, 2021 14:13
Every container will touch a file below /mnt/drain-stamps after the
drain script finished. It will then enter a loop, to delay the actual
exit.
Once the required numbers of drain stamps (=number of bpm processes) has
been written, the drain script wrappers exit.

This runs in parallel with TerminationGracePeriod, which the user user
has to set to a an adequate value.
Still trying to figure out what is causing the failing integration test
suite, where all tests pass.

Instead of updating these tests, they were removed:
* using deprecated  ginkgo functionality
* unclear what is being tested
* no longer needed
@manno manno force-pushed the drain-done-stamps-177254980 branch 4 times, most recently from cd9dcc5 to 47a3f87 Compare April 8, 2021 07:36
According to the debug logs this test introduced the ginkgo node
timeouts in CI.

GinkgoRecover is needed to collect failures from a go routine.

After adding "defer GinkgoRecover" to the drain test, it would sometimes
hang indefinitely. The "Eventually" was added to make sure the async
test exits.

Also removed the "lifecycle" test. The lifecycle part had already been
removed two years ago and it was now a duplicate of tests in deploy_test.go.
@manno manno force-pushed the drain-done-stamps-177254980 branch from 8eb08af to 913f569 Compare April 8, 2021 15:20
manno added 2 commits April 8, 2021 18:01
Test was flaky: drains-0 was already gone when trying to check logs.

Relying on pod stdout has proven to always be flaky. The flakiness was
hidden by the goroutines, which didn't use GinkgoRecover.
@manno manno force-pushed the drain-done-stamps-177254980 branch from d52a124 to fa80464 Compare April 9, 2021 08:00
@manno manno changed the title Drain done stamps Wait for all drain scripts to finish Apr 9, 2021
@manno manno merged commit 2d6a8bd into master Apr 9, 2021
@manno manno deleted the drain-done-stamps-177254980 branch April 9, 2021 08:46
@jandubois
Copy link
Member

Hi @manno!

I see you have merged this PR; are you going to make a quarks release with this change, or are you waiting for it to be tested from a dev build before you commit to a new release?

I'm hoping to see some confirmation from @univ0298 that it works as expected.

@univ0298
Copy link

I was waiting to see a release but if that's not coming please let me know @manno thanks!

@manno
Copy link
Member Author

manno commented Apr 13, 2021

Either way is fine for me :)
I'll create a release then.

@manno manno added the enhancement New feature or request label Apr 13, 2021
@manno
Copy link
Member Author

manno commented Apr 13, 2021

Ok, interesting. Seems like the helm chart artifact from CI (from https://github.com/cloudfoundry-incubator/quarks-operator/actions/runs/732318599) is not public.

The only thing visible are the docker images: https://github.com/users/cfcontainerizationbot/packages/container/package/quarks-operator-dev

The release might take a till tomorrow, I'll attach the dev helm chart here: helm chart.zip

@manno
Copy link
Member Author

manno commented Apr 14, 2021

@univ0298
Copy link

@manno I don't think this is working. What I'm seeing is that if I set the terminationGracePeriod high to allow for drains to complete, the drains complete but still the pod runs, and only when the grace period is exhausted will it terminate the pod. So it seems we have ended up with waiting forever (well until the grace period limit) instead of detecting that all drains are complete. Is there any way I can try to debug things?

Here is what it looks like in the pod after all the drains have ended:

/:/var/vcap/jobs/garden# ls -latR /mnt/
/mnt/:
total 8
drwxr-xr-x 1 root root 4096 Apr 19 19:29 .
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 drain-stamps

/mnt/drain-stamps:
total 4
drwxr-xr-x 1 root root 4096 Apr 19 19:29 ..
drwxrwsrwt 2 root adm    40 Apr 19 19:25 .

@univ0298
Copy link

Discussed this with @manno yesterday. The most obvious issue is that there is a mixup in the current code where it's writing to /mnt/drain-done but it's mounted /tmp/drain-stamps

However there are other issues as well, working through them with @manno

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

3 participants