Conversation
| os.environ[ | ||
| "GOOGLE_APPLICATION_CREDENTIALS" | ||
| ] = f"/workdir/{config['aws2gce_credentials']}" | ||
| ] = f"/workdir/{config['aws2gke_credentials']}" |
There was a problem hiding this comment.
this should not be hard coded here?
There was a problem hiding this comment.
fixed to use GCE credentials instead so no changes needed here
release/ray_release/glue.py
Outdated
| # non critical for some tests. So separate it from the general one. | ||
| fetch_result_exception = None | ||
|
|
||
| if test.get_name().endswith(".kuberay"): |
There was a problem hiding this comment.
why kuberay tests need to be handled differently?
I thought when the right methods are implemented, we should be able to reuse most of the code in this function unchanged?
There was a problem hiding this comment.
I remember we talked about this before.. I intended to do that at the beginning but running release tests with Kuberay doesn't require all the steps that we have for anyscale (like submitting a compute config beforehand) so I don't feel it's necessary to implement all the steps. This one does reuse a lot of existing functions though
There was a problem hiding this comment.
hmm.. ok, then could you implement another function with clearer input args, and intercept/branch out earlier?
there a set of vars that are initialized before this, and they are mostly not used.
essentially:
def _run_release_test_anyscale(...): # this is renamed from the existing function
...
def _run_release_test_kuberay(...): # this is your new function
...
def run_release_test(...):
if test.get_name().endswith(".kuberay"):
return _run_release_test_kuberay(...) # only send in required args
return run_release_test_anyscale(...)
There was a problem hiding this comment.
and also, this should use test.get_env() == "kuberay", right? rather than checking based on name's suffix?
| for job in response_json["jobs"]: | ||
| print(job) | ||
| print() | ||
| raise Exception(f"Multiple jobs found for {self.job_name}") |
There was a problem hiding this comment.
this can happen when the job launching gets error in some way and got retried, right?
There was a problem hiding this comment.
I think the service only returns one job.. i can check more
release/ray_release/util.py
Outdated
| import tempfile | ||
| import time |
release/ray_release/glue.py
Outdated
| # non critical for some tests. So separate it from the general one. | ||
| fetch_result_exception = None | ||
|
|
||
| if test.get_name().endswith(".kuberay"): |
There was a problem hiding this comment.
hmm.. ok, then could you implement another function with clearer input args, and intercept/branch out earlier?
there a set of vars that are initialized before this, and they are mostly not used.
essentially:
def _run_release_test_anyscale(...): # this is renamed from the existing function
...
def _run_release_test_kuberay(...): # this is your new function
...
def run_release_test(...):
if test.get_name().endswith(".kuberay"):
return _run_release_test_kuberay(...) # only send in required args
return run_release_test_anyscale(...)
release/ray_release/glue.py
Outdated
| # if no_terminate and run_type == "anyscale_job": | ||
| # logger.warning( | ||
| # "anyscale_job run type does not support --no-terminate. " | ||
| # "Switching to job (Ray Job) run type." | ||
| # ) | ||
| # run_type = "job" |
There was a problem hiding this comment.
just remove these code? maybe even in a separate PR?
| response = requests.post(url, json=request, headers=headers) | ||
| print(f"Response: {response}") | ||
| response.raise_for_status() |
There was a problem hiding this comment.
have a helper class / function or something for sending request?
There was a problem hiding this comment.
do we need a helper if it's just 2 lines?
| response = requests.post(url, json=request, headers=headers) | ||
| print(f"Response: {response}") | ||
| response.raise_for_status() | ||
| except Exception as e: |
There was a problem hiding this comment.
catch a more specific exception type, rather than generic Exception ?
|
|
||
| def _wait_job(self, timeout: int = 7200) -> Tuple[int, float]: | ||
| start_time = time.time() | ||
| next_status = start_time + 10 |
There was a problem hiding this comment.
rename to next_status_timestamp? next_status feels like a status.
| # ) | ||
|
|
||
| maybe_fetch_api_token() | ||
| # maybe_fetch_api_token() |
There was a problem hiding this comment.
why is this commented out?
this will break existing release test, right?
release/ray_release/glue.py
Outdated
| # non critical for some tests. So separate it from the general one. | ||
| fetch_result_exception = None | ||
|
|
||
| if test.get_name().endswith(".kuberay"): |
There was a problem hiding this comment.
and also, this should use test.get_env() == "kuberay", right? rather than checking based on name's suffix?
release/ray_release/util.py
Outdated
| return joined_path | ||
|
|
||
|
|
||
| def convert_cluster_compute_to_kuberay_compute_config(compute_config: dict) -> dict: |
There was a problem hiding this comment.
could you separate this function into another PR, and add some unit tests?
also, could you move this function into a new file? maybe kuberay_util.py or something? rather than generic util.py?
There was a problem hiding this comment.
so this function here can be removed now? since the other PR is merged?
release/requirements_buildkite.txt
Outdated
| --hash=sha256:11405fed67b68f406b3f3c7c5ae5104a79d2d309666d10d61b152e91d28fb968 \ | ||
| --hash=sha256:843934ef8c09e3e858952887467f8256aac3910c55f077a359a65b2b3cde3e68 | ||
| # via google-api-core | ||
| h11==0.14.0 \ |
There was a problem hiding this comment.
there is already an h11==0.16.0 above, how is this getting here? seems to be an incorrect merge?
| return config | ||
|
|
||
|
|
||
| def upload_working_dir(working_dir: str) -> str: |
There was a problem hiding this comment.
we should probably consider also using this for anyscale, so that the working dir is unified.
Signed-off-by: kevin <kevin@anyscale.com>
| @@ -0,0 +1 @@ | |||
| GOOGLE_CLOUD_PROJECT="dhyey-dev" | |||
There was a problem hiding this comment.
can we remove the mentioning of @dayshah 's name here?
release/ray_release/glue.py
Outdated
| smoke_test=smoke_test, | ||
| test_definition_root=test_definition_root, | ||
| ) | ||
| else: |
There was a problem hiding this comment.
nit: drop the else, and save an indentation?
| ) | ||
|
|
||
|
|
||
| def run_release_test_kuberay( |
There was a problem hiding this comment.
as discussed offline, please add some unit test for this new function.
| response = requests.post(url, json=request, headers=headers) | ||
| print(f"Response: {response}") | ||
| response.raise_for_status() | ||
| except Exception as e: |
| while True: | ||
| now = time.time() | ||
| if now >= deadline_timestamp: | ||
| self._terminate_job() | ||
| if not job_running: | ||
| raise JobStartupTimeout( | ||
| "Cluster did not start within " | ||
| f"{self.cluster_startup_timeout} seconds." | ||
| ) | ||
| raise CommandTimeout(f"Job timed out after {timeout_sec} seconds") | ||
|
|
||
| if now >= next_status_timestamp: | ||
| if job_running: | ||
| logger.info( | ||
| f"... job still running ... ({int(now - start_timestamp)} seconds, {int(deadline_timestamp - now)} seconds to timeout)" | ||
| ) | ||
| else: | ||
| logger.info( | ||
| f"... job not yet running ... ({int(now - start_timestamp)} seconds, {int(deadline_timestamp - now)} seconds to timeout)" | ||
| ) | ||
| next_status_timestamp += JOB_STATUS_CHECK_INTERVAL |
There was a problem hiding this comment.
are these waiting code copied from the other job manager?
| # if not anyscale_project: | ||
| # raise ReleaseTestCLIError( | ||
| # "You have to set the ANYSCALE_PROJECT environment variable!" | ||
| # ) |
There was a problem hiding this comment.
condition this check on if it is running an anyscale job?
release/ray_release/glue.py
Outdated
| test_definition_root: Optional[str] = None, | ||
| log_streaming_limit: int = LAST_LOGS_LENGTH, | ||
| ) -> Result: | ||
| if test.get_env() == "kuberay": |
There was a problem hiding this comment.
this should be using is_kuberay()?
release/ray_release/test.py
Outdated
| def get_env(self) -> str: | ||
| """ | ||
| Returns the environment of the test. | ||
| """ | ||
| return self.get("env") |
There was a problem hiding this comment.
other than checking is_kuberay, this function does not seem to be used anywhere?
maybe just remove this?
release/ray_release/util.py
Outdated
| return joined_path | ||
|
|
||
|
|
||
| def convert_cluster_compute_to_kuberay_compute_config(compute_config: dict) -> dict: |
There was a problem hiding this comment.
so this function here can be removed now? since the other PR is merged?
|
Tune test is passing: https://buildkite.com/ray-project/release/builds/45415 |
release/ray_release/glue.py
Outdated
| # Workaround while Anyscale Jobs don't support leaving cluster alive | ||
| # after the job has finished. |
There was a problem hiding this comment.
could you remove these comments too?
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
cbd7f11 to
aff9290
Compare
|
@khluu helped you resolved the merge conflict and merged. please remember to follow up with unit tests |
Release tests seem to be failing because of a merge conflict on master after #53390 was merged where a param is missing Signed-off-by: kevin <kevin@anyscale.com>
starts running gke tests nightly Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Release tests seem to be failing because of a merge conflict on master after #53390 was merged where a param is missing Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Release tests seem to be failing because of a merge conflict on master after ray-project#53390 was merged where a param is missing Signed-off-by: kevin <kevin@anyscale.com>
starts running gke tests nightly Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Release tests seem to be failing because of a merge conflict on master after #53390 was merged where a param is missing Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
No description provided.