Skip to content

[ci] First release test on GKE#53390

Merged
aslonnie merged 22 commits intomasterfrom
khluu/gke_1test
Jun 14, 2025
Merged

[ci] First release test on GKE#53390
aslonnie merged 22 commits intomasterfrom
khluu/gke_1test

Conversation

@khluu
Copy link
Copy Markdown
Contributor

@khluu khluu commented May 29, 2025

No description provided.

p
Signed-off-by: kevin <kevin@anyscale.com>
@khluu khluu requested a review from a team as a code owner May 29, 2025 00:26
@khluu khluu requested a review from aslonnie May 29, 2025 00:28
os.environ[
"GOOGLE_APPLICATION_CREDENTIALS"
] = f"/workdir/{config['aws2gce_credentials']}"
] = f"/workdir/{config['aws2gke_credentials']}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be hard coded here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed to use GCE credentials instead so no changes needed here

# non critical for some tests. So separate it from the general one.
fetch_result_exception = None

if test.get_name().endswith(".kuberay"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why kuberay tests need to be handled differently?

I thought when the right methods are implemented, we should be able to reuse most of the code in this function unchanged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember we talked about this before.. I intended to do that at the beginning but running release tests with Kuberay doesn't require all the steps that we have for anyscale (like submitting a compute config beforehand) so I don't feel it's necessary to implement all the steps. This one does reuse a lot of existing functions though

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. ok, then could you implement another function with clearer input args, and intercept/branch out earlier?

there a set of vars that are initialized before this, and they are mostly not used.

essentially:

def _run_release_test_anyscale(...): # this is renamed from the existing function
   ...

def _run_release_test_kuberay(...): # this is your new function
   ...

def run_release_test(...):
   if test.get_name().endswith(".kuberay"):
      return _run_release_test_kuberay(...) # only send in required args
   
   return run_release_test_anyscale(...)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also, this should use test.get_env() == "kuberay", right? rather than checking based on name's suffix?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for job in response_json["jobs"]:
print(job)
print()
raise Exception(f"Multiple jobs found for {self.job_name}")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can happen when the job launching gets error in some way and got retried, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the service only returns one job.. i can check more

Comment on lines +255 to +256
import tempfile
import time
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import on file top?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

khluu added 4 commits May 30, 2025 22:31
p
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
p
Signed-off-by: kevin <kevin@anyscale.com>
@khluu khluu requested a review from aslonnie May 30, 2025 23:51
p
Signed-off-by: kevin <kevin@anyscale.com>
# non critical for some tests. So separate it from the general one.
fetch_result_exception = None

if test.get_name().endswith(".kuberay"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. ok, then could you implement another function with clearer input args, and intercept/branch out earlier?

there a set of vars that are initialized before this, and they are mostly not used.

essentially:

def _run_release_test_anyscale(...): # this is renamed from the existing function
   ...

def _run_release_test_kuberay(...): # this is your new function
   ...

def run_release_test(...):
   if test.get_name().endswith(".kuberay"):
      return _run_release_test_kuberay(...) # only send in required args
   
   return run_release_test_anyscale(...)

Comment on lines +108 to +113
# if no_terminate and run_type == "anyscale_job":
# logger.warning(
# "anyscale_job run type does not support --no-terminate. "
# "Switching to job (Ray Job) run type."
# )
# run_type = "job"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove these code? maybe even in a separate PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +84 to +86
response = requests.post(url, json=request, headers=headers)
print(f"Response: {response}")
response.raise_for_status()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a helper class / function or something for sending request?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a helper if it's just 2 lines?

response = requests.post(url, json=request, headers=headers)
print(f"Response: {response}")
response.raise_for_status()
except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch a more specific exception type, rather than generic Exception ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(kind reminder)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


def _wait_job(self, timeout: int = 7200) -> Tuple[int, float]:
start_time = time.time()
next_status = start_time + 10
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to next_status_timestamp? next_status feels like a status.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# )

maybe_fetch_api_token()
# maybe_fetch_api_token()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this commented out?

this will break existing release test, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# non critical for some tests. So separate it from the general one.
fetch_result_exception = None

if test.get_name().endswith(".kuberay"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also, this should use test.get_env() == "kuberay", right? rather than checking based on name's suffix?

return joined_path


def convert_cluster_compute_to_kuberay_compute_config(compute_config: dict) -> dict:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you separate this function into another PR, and add some unit tests?

also, could you move this function into a new file? maybe kuberay_util.py or something? rather than generic util.py?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done here #53681

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this function here can be removed now? since the other PR is merged?

--hash=sha256:11405fed67b68f406b3f3c7c5ae5104a79d2d309666d10d61b152e91d28fb968 \
--hash=sha256:843934ef8c09e3e858952887467f8256aac3910c55f077a359a65b2b3cde3e68
# via google-api-core
h11==0.14.0 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already an h11==0.16.0 above, how is this getting here? seems to be an incorrect merge?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return config


def upload_working_dir(working_dir: str) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably consider also using this for anyscale, so that the working dir is unified.

khluu added 5 commits June 5, 2025 07:29
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
p
Signed-off-by: kevin <kevin@anyscale.com>
@khluu khluu requested a review from aslonnie June 9, 2025 22:55
khluu added 2 commits June 10, 2025 09:14
p
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
@khluu khluu changed the title [ci] Release test on GKE [ci] First release test on GKE Jun 10, 2025
Signed-off-by: kevin <kevin@anyscale.com>
@@ -0,0 +1 @@
GOOGLE_CLOUD_PROJECT="dhyey-dev"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the mentioning of @dayshah 's name here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol +1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

smoke_test=smoke_test,
test_definition_root=test_definition_root,
)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop the else, and save an indentation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)


def run_release_test_kuberay(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, please add some unit test for this new function.

response = requests.post(url, json=request, headers=headers)
print(f"Response: {response}")
response.raise_for_status()
except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(kind reminder)

Comment on lines +119 to +139
while True:
now = time.time()
if now >= deadline_timestamp:
self._terminate_job()
if not job_running:
raise JobStartupTimeout(
"Cluster did not start within "
f"{self.cluster_startup_timeout} seconds."
)
raise CommandTimeout(f"Job timed out after {timeout_sec} seconds")

if now >= next_status_timestamp:
if job_running:
logger.info(
f"... job still running ... ({int(now - start_timestamp)} seconds, {int(deadline_timestamp - now)} seconds to timeout)"
)
else:
logger.info(
f"... job not yet running ... ({int(now - start_timestamp)} seconds, {int(deadline_timestamp - now)} seconds to timeout)"
)
next_status_timestamp += JOB_STATUS_CHECK_INTERVAL
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these waiting code copied from the other job manager?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pretty much

Comment on lines +136 to +139
# if not anyscale_project:
# raise ReleaseTestCLIError(
# "You have to set the ANYSCALE_PROJECT environment variable!"
# )
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition this check on if it is running an anyscale job?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test_definition_root: Optional[str] = None,
log_streaming_limit: int = LAST_LOGS_LENGTH,
) -> Result:
if test.get_env() == "kuberay":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be using is_kuberay()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +461 to +465
def get_env(self) -> str:
"""
Returns the environment of the test.
"""
return self.get("env")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than checking is_kuberay, this function does not seem to be used anywhere?

maybe just remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return joined_path


def convert_cluster_compute_to_kuberay_compute_config(compute_config: dict) -> dict:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this function here can be removed now? since the other PR is merged?

khluu added 5 commits June 11, 2025 21:03
p
Signed-off-by: kevin <kevin@anyscale.com>
p
Signed-off-by: kevin <kevin@anyscale.com>
p
Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: kevin <kevin@anyscale.com>
@aslonnie aslonnie self-requested a review June 11, 2025 23:47
@khluu
Copy link
Copy Markdown
Contributor Author

khluu commented Jun 13, 2025

Tune test is passing: https://buildkite.com/ray-project/release/builds/45415

@aslonnie aslonnie added the go add ONLY when ready to merge, run all tests label Jun 13, 2025
p
Signed-off-by: kevin <kevin@anyscale.com>
Comment on lines 103 to 104
# Workaround while Anyscale Jobs don't support leaving cluster alive
# after the job has finished.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove these comments too?

p
Signed-off-by: kevin <kevin@anyscale.com>
@aslonnie aslonnie self-requested a review June 14, 2025 03:27
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
@aslonnie aslonnie merged commit af6b469 into master Jun 14, 2025
5 checks passed
@aslonnie aslonnie deleted the khluu/gke_1test branch June 14, 2025 11:51
@aslonnie
Copy link
Copy Markdown
Collaborator

@khluu helped you resolved the merge conflict and merged.

please remember to follow up with unit tests

aslonnie pushed a commit that referenced this pull request Jun 16, 2025
Release tests seem to be failing because of a merge conflict on master
after #53390 was merged where a
param is missing

Signed-off-by: kevin <kevin@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
starts running gke tests nightly

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jun 18, 2025
Release tests seem to be failing because of a merge conflict on master
after #53390 was merged where a
param is missing

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
Release tests seem to be failing because of a merge conflict on master
after ray-project#53390 was merged where a
param is missing

Signed-off-by: kevin <kevin@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
starts running gke tests nightly

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
Release tests seem to be failing because of a merge conflict on master
after #53390 was merged where a
param is missing

Signed-off-by: kevin <kevin@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants