Adjust durations for PodLifecycleSleepAction e2e tests.#128642
Conversation
|
/hold Needs some rounds of tests. |
8580cb7 to
b8fec3e
Compare
|
/test all |
2 similar comments
|
/test all |
|
/test all |
| podClient.CreateSync(ctx, podWithHook) | ||
| ginkgo.By("delete the pod with lifecycle hook using sleep action") | ||
| start := time.Now() | ||
| podClient.DeleteSync(ctx, podWithHook.Name, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout) |
There was a problem hiding this comment.
This test relies on sending a Delete request and wait for the pod not present in the namespace, this sounds very brittle to me, and the solution to consider very large windows of absorb any possible CI latency makes that we'll not be completely sure that the feature is working as expected.
@thockin don't we have any other method to validate this feature than relying on the overall time for the pod to disappear?
There was a problem hiding this comment.
The main logic of the sleep hook is in the container lifecycle management code within kubelet, where it waits for a certain period when a container terminates. This means we may not have a way to monitor this process other than by observing the pod status.
However, instead of waiting for the pod to disappear from the namespace after deletion, there might be an alternative approach. This approach doesn't require the full lifecycle of creating and deleting the pod, which could yield higher accuracy than the current method.
We create a pod with a restartPolicy of Always and add a livenessProbe that will fail quickly. When the livenessProbe fails, kubelet will send a TERM signal to the container and start executing the sleep hook.
After the sleep hook executes, kubelet will increment the container's restart count by one. We only need to check the time interval between when the restart count reaches 1 and when the container enters the Running state.
An additional benefit of this method is that we don't need to verify if the time we capture falls between the sleep hook execution and the TerminationGracePeriodSeconds(This greatly reduces the probability of test failure). We don’t need to worry about the value of TerminationGracePeriodSeconds; it can be an arbitrarily large value.
A possible pod is:
apiVersion: v1
kind: Pod
metadata:
name: auto-restart-pod
spec:
terminationGracePeriodSeconds: 90
containers:
- name: auto-restart-container
image: k8s.gcr.io/pause:3.2
livenessProbe:
tcpSocket:
port: 80
periodSeconds: 1
lifecycle:
preStop:
sleep:
seconds: 10
restartPolicy: Always
/cc @SergeyKanzhelev @kannon92 WDYT?
There was a problem hiding this comment.
Can we reuse the logic in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/container_probe.go?
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
There was a problem hiding this comment.
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
Yes, we can redesign this test based on this code.
I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.
10s is the time I set for local testing. We can make it longer in CI.
There was a problem hiding this comment.
I don't have a better answer, sadly. This feature is fundamentally about sleeping. That said, the implementation of this is pretty trivial, so I think it's low risk of false-pass.
We can set the sleep pretty short (5-10s), the grace period very long (2-5m) and assert that the observed time was between sleep (O(seconds) and grace (O(minutes)). It doesn't need to be super tight, we just don't want to waste time.
There was a problem hiding this comment.
Currently my implementation is same as @thockin suggest.
@HirazawaUi has given a great idea here, but I'm not sure if it's worth to implement this way here. We still need to check the duration in this way and can also show flaky?
Notice that although now we make a super long grace period, these tests generally don't last that long.
In the first test, we verify a normal sleepAction (15s), so deletion starts immediately after 15 seconds.
In the second test, we change the grace period to 15s, so deletion also starts immediately after 15 seconds.
In the third test, we test an exceptional case where sleepAction is ignored, and deletion proceeds immediately.
There was a problem hiding this comment.
adding a finalizer and checking pod conditions can help us here ?
the finalizer will block the deletion of the pod object IIRC, is there any condition in the pod status that can serve us to calculate the sleep time?
There was a problem hiding this comment.
is there any condition in the pod status that can serve us to calculate the sleep time?
I think only the .metadata.deletionTimestamp can help here?
|
/priority important-soon |
|
Please aim to address this in 1.33. We would like to promote this feature once these tests are stable. |
|
/retest |
|
/retest |
|
@aojea
We have another case left, and I don't think we can use finalizer on it: |
why do we need the deletion timestamp for this case if the trigger of the hook is the time where the container exits, no? |
| // TODO: reduce sleep_seconds and tgps after issues.k8s.io/132205 is solved | ||
| // we get deletionTimestamp before container become terminated here because of issues.k8s.io/132205 | ||
| deletionTS := p.DeletionTimestamp.Time | ||
| if err := e2epod.WaitForContainerTerminated(ctx, f.ClientSet, p.Namespace, p.Name, name, time.Second*40); err != nil { |
There was a problem hiding this comment.
This should be little longer than sleep_seconds?
Changed to sleep_seconds*2.
d93dde5 to
265d035
Compare
|
/approve final LGTM for Tim or Sergey for reviewing the subtlies of the sleep actions, they have a much deeper knowledge of this lifecycle hooks than me |
|
LGTM label has been added. DetailsGit tree hash: 92add94831673f4f5fc7282843c51e626c53c790 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, AxeZhan, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This pr adjusts the duration of the e2e tests of PodLifecycleSleepAction, to avoid flakes.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: