Skip to content

fix(entrypoint): skip drain/uncordon on agent nodes#1648

Merged
iwilltry42 merged 1 commit intok3d-io:mainfrom
dpritchett:fix/agent-entrypoint-no-drain
Mar 24, 2026
Merged

fix(entrypoint): skip drain/uncordon on agent nodes#1648
iwilltry42 merged 1 commit intok3d-io:mainfrom
dpritchett:fix/agent-entrypoint-no-drain

Conversation

@dpritchett
Copy link
Copy Markdown
Contributor

What

Skip kubectl uncordon and kubectl drain on agent nodes in k3d-entrypoint.sh. Only server nodes run drain/uncordon; agents get clean SIGTERM forwarding only.

Also captures $1 into a K3S_ROLE variable at script scope, since $1 inside a shell function refers to the function's arguments, not the script's. Without this, set -o nounset would crash when the trap fires on shutdown.

Why

PR #1119 added graceful drain/uncordon to the entrypoint, but it runs unconditionally on all node types. Agent nodes don't have a kubeconfig at the default path (/etc/rancher/k3s/k3s.yaml), so kubectl falls back to localhost:8080 and retries forever, spamming agent logs from the moment the node starts.

Fixes #1420, #1535
May also help with #1526, #1452 (multi-server restart hangs)

Implications

This changes behavior for agent nodes only. Server nodes are unaffected and still drain on shutdown and uncordon on start, exactly as before.

The change is in pkg/types/fixes/assets/k3d-entrypoint.sh (embedded shell script). No Go code changes. No CLI changes.

We match $1 = "server" explicitly rather than excluding "agent", so any unexpected value (e.g. someone running the container image directly with arbitrary args) falls through to the safe default of SIGTERM forwarding only.

Testing

Tested locally against k3s v1.34.3+k3s3 with a patched binary (make build re-embeds the script via //go:embed).

$1 validation: Confirmed via docker inspect --format '{{json .Config.Cmd}}' that server containers receive ["server", ...] and agent containers receive ["agent"].

1 server + 1 agent cluster:

  • Agent logs: zero localhost:8080 errors (was infinite loop before this fix)
  • Both nodes Ready, cluster functional

Graceful shutdown (cluster stop):

  • Server logs: Draining node... (drain ran as expected)
  • Agent logs: Sending SIGTERM to k3s... / Waiting for k3s to close... / Bye! (no drain, clean exit)
  • Cluster restarted cleanly, both nodes back to Ready

3 servers + 2 agents:

  • All 5 nodes Ready on create
  • Both agents: zero localhost:8080 errors
  • Cluster stop: all 3 servers drained, both agents SIGTERM-only
  • Cluster restart: all 5 nodes back to Ready

Agent nodes don't have a kubeconfig at the default path, so the
kubectl uncordon/drain calls added in k3d-io#1119 fail in an infinite retry
loop, spamming logs with localhost:8080 connection refused errors.

Gate drain/uncordon on K3S_ROLE=server so agents get clean SIGTERM
forwarding only. Match server explicitly rather than excluding
agent so unknown values fall through to the safe default.

Also captures  into K3S_ROLE before defining cleanup(), since
inside a function refers to the function's args, not the script's
(would crash under set -o nounset on shutdown).

Fixes k3d-io#1420, k3d-io#1535
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the embedded k3d-entrypoint.sh to avoid running kubectl uncordon/kubectl drain on k3s agent nodes, preventing infinite kubectl retry spam when agents lack a usable kubeconfig.

Changes:

  • Capture the initial k3s subcommand ($1) into a script-scope K3S_ROLE to keep trap/cleanup logic working under set -o nounset.
  • Gate kubectl uncordon (startup) and kubectl drain (shutdown) so they only run when K3S_ROLE="server".
  • Keep agent shutdown behavior to SIGTERM forwarding + wait only.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dpritchett
Copy link
Copy Markdown
Contributor Author

@iwilltry42 anything in particular I can do to help triage this one? I appreciate that you're likely to be plenty busy on other projects already.

@daxmc99
Copy link
Copy Markdown

daxmc99 commented Mar 23, 2026

👀
Also hit this today

@iwilltry42 iwilltry42 merged commit 2e015b3 into k3d-io:main Mar 24, 2026
10 checks passed
@iwilltry42
Copy link
Copy Markdown
Member

Finally merged. Sorry for the long wait and thanks for your contribution @dpritchett !

@dpritchett dpritchett deleted the fix/agent-entrypoint-no-drain branch March 24, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] The entrypoint script in an agent node continuously fails at the line until kubectl uncordon "$HOSTNAME"; do sleep 3; done as KUBECONFIG not set

4 participants