Adjust durations for PodLifecycleSleepAction e2e tests. by AxeZhan · Pull Request #128642 · kubernetes/kubernetes

AxeZhan · 2024-11-07T05:11:28Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pr adjusts the duration of the e2e tests of PodLifecycleSleepAction, to avoid flakes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

AxeZhan · 2024-11-07T05:11:43Z

/hold

Needs some rounds of tests.

AxeZhan · 2024-11-07T07:43:55Z

/test all

AxeZhan · 2024-11-07T09:18:39Z

/test all

AxeZhan · 2024-11-07T10:32:27Z

/test all

aojea · 2024-11-07T12:54:18Z

 			podClient.CreateSync(ctx, podWithHook)
 			ginkgo.By("delete the pod with lifecycle hook using sleep action")
 			start := time.Now()
 			podClient.DeleteSync(ctx, podWithHook.Name, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout)


This test relies on sending a Delete request and wait for the pod not present in the namespace, this sounds very brittle to me, and the solution to consider very large windows of absorb any possible CI latency makes that we'll not be completely sure that the feature is working as expected.

@thockin don't we have any other method to validate this feature than relying on the overall time for the pod to disappear?

The main logic of the sleep hook is in the container lifecycle management code within kubelet, where it waits for a certain period when a container terminates. This means we may not have a way to monitor this process other than by observing the pod status.

However, instead of waiting for the pod to disappear from the namespace after deletion, there might be an alternative approach. This approach doesn't require the full lifecycle of creating and deleting the pod, which could yield higher accuracy than the current method.

We create a pod with a restartPolicy of Always and add a livenessProbe that will fail quickly. When the livenessProbe fails, kubelet will send a TERM signal to the container and start executing the sleep hook.

After the sleep hook executes, kubelet will increment the container's restart count by one. We only need to check the time interval between when the restart count reaches 1 and when the container enters the Running state.

An additional benefit of this method is that we don't need to verify if the time we capture falls between the sleep hook execution and the TerminationGracePeriodSeconds(This greatly reduces the probability of test failure). We don’t need to worry about the value of TerminationGracePeriodSeconds; it can be an arbitrarily large value.

A possible pod is:

apiVersion: v1 kind: Pod metadata: name: auto-restart-pod spec: terminationGracePeriodSeconds: 90 containers: - name: auto-restart-container image: k8s.gcr.io/pause:3.2 livenessProbe: tcpSocket: port: 80 periodSeconds: 1 lifecycle: preStop: sleep: seconds: 10 restartPolicy: Always

/cc @SergeyKanzhelev @kannon92 WDYT?

Can we reuse the logic in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/container_probe.go?

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

Yes, we can redesign this test based on this code.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

10s is the time I set for local testing. We can make it longer in CI.

I don't have a better answer, sadly. This feature is fundamentally about sleeping. That said, the implementation of this is pretty trivial, so I think it's low risk of false-pass.

We can set the sleep pretty short (5-10s), the grace period very long (2-5m) and assert that the observed time was between sleep (O(seconds) and grace (O(minutes)). It doesn't need to be super tight, we just don't want to waste time.

Currently my implementation is same as @thockin suggest.
@HirazawaUi has given a great idea here, but I'm not sure if it's worth to implement this way here. We still need to check the duration in this way and can also show flaky?

Notice that although now we make a super long grace period, these tests generally don't last that long.
In the first test, we verify a normal sleepAction (15s), so deletion starts immediately after 15 seconds.
In the second test, we change the grace period to 15s, so deletion also starts immediately after 15 seconds.
In the third test, we test an exceptional case where sleepAction is ignored, and deletion proceeds immediately.

adding a finalizer and checking pod conditions can help us here ?
the finalizer will block the deletion of the pod object IIRC, is there any condition in the pod status that can serve us to calculate the sleep time?

is there any condition in the pod status that can serve us to calculate the sleep time?

I think only the .metadata.deletionTimestamp can help here?

commented on kubernetes/enhancements#3960 (comment)

kannon92 · 2024-11-20T18:57:02Z

/priority important-soon
/triage accepted

kannon92 · 2024-11-20T18:57:21Z

Please aim to address this in 1.33. We would like to promote this feature once these tests are stable.

AxeZhan · 2024-11-21T02:57:43Z

/retest

AxeZhan · 2025-06-11T04:52:07Z

/retest

AxeZhan · 2025-06-11T06:33:01Z

@aojea
I've updated the first two cases to use finalizer:

normal case
reduce tgps to less than sleep seconds when delete

We have another case left, and I don't think we can use finalizer on it:
3. Container is already terminated, and then delete the pod
The reason is in this case, deletionTimestamp will equal to delete_timestamp+tgps, but finishedAt will equal to start_timestamp instead of delete_timestamp+sleep_seconds

aojea · 2025-06-11T10:15:41Z

The reason is in this case, deletionTimestamp will equal to delete_timestamp+tgps, but finishedAt will equal to start_timestamp instead of delete_timestamp+sleep_seconds

why do we need the deletion timestamp for this case if the trigger of the hook is the time where the container exits, no?
I think that time is already in one condition, but we can also make the container fail and output the time and get that from the logs ... we do not need to delete the pod nor use a finalizer since the times we are interested are happening before deletion AFAIK

aojea · 2025-06-11T10:18:30Z

+			// TODO: reduce sleep_seconds and tgps after issues.k8s.io/132205 is solved
+			// we get deletionTimestamp before container become terminated here because of issues.k8s.io/132205
+			deletionTS := p.DeletionTimestamp.Time
+			if err := e2epod.WaitForContainerTerminated(ctx, f.ClientSet, p.Namespace, p.Name, name, time.Second*40); err != nil {


why is this 40?

This should be little longer than sleep_seconds?
Changed to sleep_seconds*2.

aojea · 2025-06-11T15:26:58Z

/approve
/hold cancel
/assign @thockin @SergeyKanzhelev

final LGTM for Tim or Sergey for reviewing the subtlies of the sleep actions, they have a much deeper knowledge of this lifecycle hooks than me

SergeyKanzhelev

/lgtm

k8s-ci-robot · 2025-06-11T23:39:55Z

LGTM label has been added.

Details

Git tree hash: 92add94831673f4f5fc7282843c51e626c53c790

k8s-ci-robot · 2025-06-11T23:40:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, AxeZhan, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/e2e/common/OWNERS~~ [SergeyKanzhelev,aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from haircommander and kannon92 November 7, 2024 05:11

AxeZhan mentioned this pull request Nov 7, 2024

Graduate PodLifecycleSleepAction to GA #128046

Merged

AxeZhan force-pushed the test3960 branch 2 times, most recently from 8580cb7 to b8fec3e Compare November 7, 2024 06:31

aojea reviewed Nov 7, 2024

View reviewed changes

AxeZhan force-pushed the test3960 branch from b8fec3e to a6182e1 Compare November 9, 2024 06:04

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025

aojea reviewed Jun 11, 2025

View reviewed changes

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated

aojea reviewed Jun 11, 2025

View reviewed changes

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated

aojea reviewed Jun 11, 2025

View reviewed changes

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated

AxeZhan force-pushed the test3960 branch 4 times, most recently from d93dde5 to 265d035 Compare June 11, 2025 12:35

using finalizer

69eddde

AxeZhan force-pushed the test3960 branch from 265d035 to 69eddde Compare June 11, 2025 13:49

k8s-ci-robot assigned SergeyKanzhelev Jun 11, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025

k8s-ci-robot assigned thockin Jun 11, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2025

SergeyKanzhelev approved these changes Jun 11, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2025

k8s-ci-robot merged commit 926e176 into kubernetes:master Jun 12, 2025
14 checks passed

k8s-ci-robot added this to the v1.34 milestone Jun 12, 2025

github-project-automation Bot moved this from PRs Waiting on Author to Done in SIG Node CI/Test Board Jun 12, 2025

github-project-automation Bot moved this from Work in progress to Done in SIG Node: code and documentation PRs Jun 12, 2025

AxeZhan mentioned this pull request Jun 12, 2025

Flaky test fix for KEP3960: Lifecycle Sleep Hook #132250

Merged

ngopalak-redhat mentioned this pull request Jun 27, 2025

OCPBUGS-34594: Include a working test case for PodLifecycleSleepAction openshift/kubernetes#2312

Closed

haircommander mentioned this pull request Jul 16, 2025

OCPBUGS-57666: Adjust durations for PodLifecycleSleepAction e2e tests openshift/kubernetes#2363

Merged

Conversation

AxeZhan commented Nov 7, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

AxeZhan commented Nov 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HirazawaUi Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Nov 20, 2024

Uh oh!

kannon92 commented Nov 20, 2024

Uh oh!

AxeZhan commented Nov 21, 2024

Uh oh!

AxeZhan commented Jun 11, 2025

Uh oh!

AxeZhan commented Jun 11, 2025

Uh oh!

aojea commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AxeZhan Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aojea commented Jun 11, 2025

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

k8s-ci-robot commented Jun 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

HirazawaUi Nov 7, 2024 •

edited

Loading

aojea commented Jun 11, 2025 •

edited

Loading

AxeZhan Jun 11, 2025 •

edited

Loading