fix: Added check to prevent CP from joining with same name as worker node.#2263
Conversation
There was a problem hiding this comment.
Thanks for picking this up Ethan!
Could I ask you to create an integration test as well please?
This way we can test the actual hook.
I'm thinking 1) worker, cp and 2) cp, worker (for the later I'm not sure if the worker has permissions to get all nodes so just double check this case).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I've moved the node name duplication check into the get-join-token section instead. This won't cover the case where node A creates a join token, an identically-named node B creates another join token and joins the cluster, and then A attempts to join the cluster, as I think this will probably take some greater design reworking better suited for a follow-on task, i.e. creating an exposed endpoint for joining nodes to see who's in the cluster, and it's not a case that is likely to happen often. |
870e308 to
038719d
Compare
|
edit: this PR is kind of in a limbo state because of the k8sd migration. Once this PR canonical/k8sd#4 gets merged in, then we can update the version used here and this should be good to merge. |
016f271 to
c0be9cf
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
louiseschmidtgen
left a comment
There was a problem hiding this comment.
Two nits, apologies I didn't catch the one earlier:
| f'a node with this name is already part of the cluster: "{shared_node_name}"' | ||
| in error_output | ||
| ), f"Join error message should indicate duplicate node name. Got: {error_output}" |
There was a problem hiding this comment.
I'm sorry that I didn't catch it earlier: can we capitalize the errors in k8sd to export them. Then we can use those here instead of string messages that may become outdated.
There was a problem hiding this comment.
| try: | ||
| util.get_join_token( | ||
| cluster_node, joining_worker, "--worker", name=shared_node_name | ||
| ) | ||
| assert False, "get-join-token should have failed due to duplicate node name" | ||
| except tenacity.RetryError as e: | ||
| LOG.info("get-join-token failed as expected") | ||
| cause = e.last_attempt.exception() | ||
| if not isinstance(cause, subprocess.CalledProcessError): | ||
| raise e | ||
| error_output = ( | ||
| cause.stderr if cause.stderr else cause.stdout if cause.stdout else "" | ||
| ) | ||
| if isinstance(error_output, bytes): | ||
| error_output = error_output.decode() | ||
| assert ( | ||
| f'a node with this name is already part of the cluster: "{shared_node_name}"' | ||
| in error_output | ||
| ), f"Join error message should indicate duplicate node name. Got: {error_output}" |
There was a problem hiding this comment.
There's something neat we can use to clean this piece up a bit: https://docs.pytest.org/en/7.1.x/how-to/assert.html#assertions-about-expected-exceptions
pytest expected exceptions.
Currently, if a CP node attempts to join the cluster with the same name as an existing worker node, it will fail in Kubernetes but persist in the microcluster, creating a mismatch in data. This PR addresses this issue by adding a check in the join logic to kill a CP join if there's a worker with the same name.