Conversation
The number of methods was confusing.
83f6b33 to
4cff530
Compare
Every container will touch a file below /mnt/drain-stamps after the drain script finished. It will then enter a loop, to delay the actual exit. Once the required numbers of drain stamps (=number of bpm processes) has been written, the drain script wrappers exit. This runs in parallel with TerminationGracePeriod, which the user user has to set to a an adequate value.
Still trying to figure out what is causing the failing integration test suite, where all tests pass. Instead of updating these tests, they were removed: * using deprecated ginkgo functionality * unclear what is being tested * no longer needed
cd9dcc5 to
47a3f87
Compare
According to the debug logs this test introduced the ginkgo node timeouts in CI. GinkgoRecover is needed to collect failures from a go routine. After adding "defer GinkgoRecover" to the drain test, it would sometimes hang indefinitely. The "Eventually" was added to make sure the async test exits. Also removed the "lifecycle" test. The lifecycle part had already been removed two years ago and it was now a duplicate of tests in deploy_test.go.
8eb08af to
913f569
Compare
Test was flaky: drains-0 was already gone when trying to check logs. Relying on pod stdout has proven to always be flaky. The flakiness was hidden by the goroutines, which didn't use GinkgoRecover.
d52a124 to
fa80464
Compare
|
I was waiting to see a release but if that's not coming please let me know @manno thanks! |
|
Either way is fine for me :) |
|
Ok, interesting. Seems like the helm chart artifact from CI (from https://github.com/cloudfoundry-incubator/quarks-operator/actions/runs/732318599) is not public. The only thing visible are the docker images: https://github.com/users/cfcontainerizationbot/packages/container/package/quarks-operator-dev The release might take a till tomorrow, I'll attach the dev helm chart here: helm chart.zip |
|
@manno I don't think this is working. What I'm seeing is that if I set the terminationGracePeriod high to allow for drains to complete, the drains complete but still the pod runs, and only when the grace period is exhausted will it terminate the pod. So it seems we have ended up with waiting forever (well until the grace period limit) instead of detecting that all drains are complete. Is there any way I can try to debug things? Here is what it looks like in the pod after all the drains have ended: |
Motivation and Context
This adds a loop to wait for all other bpm containers, after the drain script has finished.
#177254980
This draft adds the shared empty dir also to the init containers, even though they don't need it.
Fixes #1297