Skip to content

replace test image to busybox#476

Merged
kangclzjc merged 4 commits into
ai-dynamo:mainfrom
kangclzjc:steady-e2e-node
Mar 10, 2026
Merged

replace test image to busybox#476
kangclzjc merged 4 commits into
ai-dynamo:mainfrom
kangclzjc:steady-e2e-node

Conversation

@kangclzjc

@kangclzjc kangclzjc commented Mar 7, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In some e2e tests, container could be killed because OOM. And after tracking that in some of our test, we use alpine-slim, and no start command in workload yaml. Because default this image will create many number of nginx workers based on the CPU cores of the node. In my L20 machine, it will create nearly 130 nginx worker for each test pod. The pod will keep crash and the node always become Unready. In our CI/CD e2e test, I also see the same failure.

Which issue(s) this PR fixes:

Fixes #475

Special notes for your reviewer:

Does this PR introduce a API change?


Additional documentation e.g., enhancement proposals, usage docs, etc.:


Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
@kangclzjc kangclzjc added run-e2e and removed run-e2e labels Mar 7, 2026
shayasoolin
shayasoolin previously approved these changes Mar 8, 2026
@kangclzjc kangclzjc added run-e2e and removed run-e2e labels Mar 8, 2026
Ronkahn21
Ronkahn21 previously approved these changes Mar 8, 2026
renormalize
renormalize previously approved these changes Mar 9, 2026
@kangclzjc kangclzjc dismissed stale reviews from renormalize, Ronkahn21, and shayasoolin via 526f671 March 9, 2026 12:04
Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: kangclzjc <kangz@nvidia.com>
@kangclzjc

kangclzjc commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

nginx:alpine-slim exits gracefully. busybox with sleep infinity should exit in theory, but when running as PID 1 or with different signal handling, it may result in slow or unreliable termination. We add terminationGracePeriodSeconds to avoid e2e relying on image.

@kangclzjc kangclzjc merged commit bf34fe9 into ai-dynamo:main Mar 10, 2026
11 checks passed
Ronkahn21 pushed a commit to Ronkahn21/grove that referenced this pull request Mar 10, 2026
* replace test image to busybox

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add command parameters for creating pclq

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add terminationGracePeriodSeconds

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add back PatchSIGTERM

Signed-off-by: kangclzjc <kangz@nvidia.com>

---------

Signed-off-by: kangclzjc <kangz@nvidia.com>
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
* replace test image to busybox

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add command parameters for creating pclq

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add terminationGracePeriodSeconds

Signed-off-by: kangclzjc <kangz@nvidia.com>

* add back PatchSIGTERM

Signed-off-by: kangclzjc <kangz@nvidia.com>

---------

Signed-off-by: kangclzjc <kangz@nvidia.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E test random fail because container OOM

5 participants