Skip to content

Adjust durations for PodLifecycleSleepAction e2e tests.#128642

Merged
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
AxeZhan:test3960
Jun 12, 2025
Merged

Adjust durations for PodLifecycleSleepAction e2e tests.#128642
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
AxeZhan:test3960

Conversation

@AxeZhan

@AxeZhan AxeZhan commented Nov 7, 2024

Copy link
Copy Markdown
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

This pr adjusts the duration of the e2e tests of PodLifecycleSleepAction, to avoid flakes.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test labels Nov 7, 2024
@AxeZhan

AxeZhan commented Nov 7, 2024

Copy link
Copy Markdown
Member Author

/hold

Needs some rounds of tests.

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 7, 2024
@AxeZhan AxeZhan force-pushed the test3960 branch 2 times, most recently from 8580cb7 to b8fec3e Compare November 7, 2024 06:31
@AxeZhan

AxeZhan commented Nov 7, 2024

Copy link
Copy Markdown
Member Author

/test all

2 similar comments
@AxeZhan

AxeZhan commented Nov 7, 2024

Copy link
Copy Markdown
Member Author

/test all

@AxeZhan

AxeZhan commented Nov 7, 2024

Copy link
Copy Markdown
Member Author

/test all

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated
podClient.CreateSync(ctx, podWithHook)
ginkgo.By("delete the pod with lifecycle hook using sleep action")
start := time.Now()
podClient.DeleteSync(ctx, podWithHook.Name, metav1.DeleteOptions{}, e2epod.DefaultPodDeletionTimeout)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relies on sending a Delete request and wait for the pod not present in the namespace, this sounds very brittle to me, and the solution to consider very large windows of absorb any possible CI latency makes that we'll not be completely sure that the feature is working as expected.

@thockin don't we have any other method to validate this feature than relying on the overall time for the pod to disappear?

@HirazawaUi HirazawaUi Nov 7, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main logic of the sleep hook is in the container lifecycle management code within kubelet, where it waits for a certain period when a container terminates. This means we may not have a way to monitor this process other than by observing the pod status.

However, instead of waiting for the pod to disappear from the namespace after deletion, there might be an alternative approach. This approach doesn't require the full lifecycle of creating and deleting the pod, which could yield higher accuracy than the current method.

We create a pod with a restartPolicy of Always and add a livenessProbe that will fail quickly. When the livenessProbe fails, kubelet will send a TERM signal to the container and start executing the sleep hook.

After the sleep hook executes, kubelet will increment the container's restart count by one. We only need to check the time interval between when the restart count reaches 1 and when the container enters the Running state.

An additional benefit of this method is that we don't need to verify if the time we capture falls between the sleep hook execution and the TerminationGracePeriodSeconds(This greatly reduces the probability of test failure). We don’t need to worry about the value of TerminationGracePeriodSeconds; it can be an arbitrarily large value.

A possible pod is:

apiVersion: v1
kind: Pod
metadata:
  name: auto-restart-pod
spec:
  terminationGracePeriodSeconds: 90
  containers:
  - name: auto-restart-container
    image: k8s.gcr.io/pause:3.2
    livenessProbe:
      tcpSocket:
        port: 80
      periodSeconds: 1
    lifecycle:
      preStop:
        sleep:
          seconds: 10
  restartPolicy: Always

/cc @SergeyKanzhelev @kannon92 WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the logic in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/container_probe.go?

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

Yes, we can redesign this test based on this code.

I think your sleep timelines are too short for a resource heavy CI and I think this will be brittle with 10s.

10s is the time I set for local testing. We can make it longer in CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a better answer, sadly. This feature is fundamentally about sleeping. That said, the implementation of this is pretty trivial, so I think it's low risk of false-pass.

We can set the sleep pretty short (5-10s), the grace period very long (2-5m) and assert that the observed time was between sleep (O(seconds) and grace (O(minutes)). It doesn't need to be super tight, we just don't want to waste time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently my implementation is same as @thockin suggest.
@HirazawaUi has given a great idea here, but I'm not sure if it's worth to implement this way here. We still need to check the duration in this way and can also show flaky?

Notice that although now we make a super long grace period, these tests generally don't last that long.
In the first test, we verify a normal sleepAction (15s), so deletion starts immediately after 15 seconds.
In the second test, we change the grace period to 15s, so deletion also starts immediately after 15 seconds.
In the third test, we test an exceptional case where sleepAction is ignored, and deletion proceeds immediately.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a finalizer and checking pod conditions can help us here ?
the finalizer will block the deletion of the pod object IIRC, is there any condition in the pod status that can serve us to calculate the sleep time?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any condition in the pod status that can serve us to calculate the sleep time?

I think only the .metadata.deletionTimestamp can help here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kannon92

Copy link
Copy Markdown
Contributor

/priority important-soon
/triage accepted

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2024
@kannon92

Copy link
Copy Markdown
Contributor

Please aim to address this in 1.33. We would like to promote this feature once these tests are stable.

@AxeZhan

AxeZhan commented Nov 21, 2024

Copy link
Copy Markdown
Member Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2025
@AxeZhan

AxeZhan commented Jun 11, 2025

Copy link
Copy Markdown
Member Author

/retest

@AxeZhan

AxeZhan commented Jun 11, 2025

Copy link
Copy Markdown
Member Author

@aojea
I've updated the first two cases to use finalizer:

  1. normal case
  2. reduce tgps to less than sleep seconds when delete

We have another case left, and I don't think we can use finalizer on it:
3. Container is already terminated, and then delete the pod
The reason is in this case, deletionTimestamp will equal to delete_timestamp+tgps, but finishedAt will equal to start_timestamp instead of delete_timestamp+sleep_seconds

@aojea

aojea commented Jun 11, 2025

Copy link
Copy Markdown
Member

The reason is in this case, deletionTimestamp will equal to delete_timestamp+tgps, but finishedAt will equal to start_timestamp instead of delete_timestamp+sleep_seconds

why do we need the deletion timestamp for this case if the trigger of the hook is the time where the container exits, no?
I think that time is already in one condition, but we can also make the container fail and output the time and get that from the logs ... we do not need to delete the pod nor use a finalizer since the times we are interested are happening before deletion AFAIK

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated
// TODO: reduce sleep_seconds and tgps after issues.k8s.io/132205 is solved
// we get deletionTimestamp before container become terminated here because of issues.k8s.io/132205
deletionTS := p.DeletionTimestamp.Time
if err := e2epod.WaitForContainerTerminated(ctx, f.ClientSet, p.Namespace, p.Name, name, time.Second*40); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this 40?

@AxeZhan AxeZhan Jun 11, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be little longer than sleep_seconds?
Changed to sleep_seconds*2.

Comment thread test/e2e/common/node/lifecycle_hook.go Outdated
Comment thread test/e2e/common/node/lifecycle_hook.go Outdated
Comment thread test/e2e/common/node/lifecycle_hook.go Outdated
@AxeZhan AxeZhan force-pushed the test3960 branch 4 times, most recently from d93dde5 to 265d035 Compare June 11, 2025 12:35
@aojea

aojea commented Jun 11, 2025

Copy link
Copy Markdown
Member

/approve
/hold cancel
/assign @thockin @SergeyKanzhelev

final LGTM for Tim or Sergey for reviewing the subtlies of the sleep actions, they have a much deeper knowledge of this lifecycle hooks than me

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 11, 2025
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2025

@SergeyKanzhelev SergeyKanzhelev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 11, 2025
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 92add94831673f4f5fc7282843c51e626c53c790

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, AxeZhan, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 926e176 into kubernetes:master Jun 12, 2025
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 12, 2025
@github-project-automation github-project-automation Bot moved this from PRs Waiting on Author to Done in SIG Node CI/Test Board Jun 12, 2025
@github-project-automation github-project-automation Bot moved this from Work in progress to Done in SIG Node: code and documentation PRs Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

9 participants