Skip to content

ray health-check#15429

Merged
wuisawesome merged 23 commits intoray-project:masterfrom
wuisawesome:healthcheck
Apr 22, 2021
Merged

ray health-check#15429
wuisawesome merged 23 commits intoray-project:masterfrom
wuisawesome:healthcheck

Conversation

@wuisawesome
Copy link
Copy Markdown
Contributor

Why are these changes needed?

(See RFC, this PR strictly follows it).

Related issue number

#15265

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@wuisawesome
Copy link
Copy Markdown
Contributor Author

cc @yiranwang52

print("Triggered gc.collect() on all workers.")


@cli.command(name="health-check", hidden=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for hidden=True

Comment on lines +1679 to +1683
if time_ok:
sys.exit(0)
else:
sys.exit(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we sys exit or just throw an error/warning/print an appropriate message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. This is only meant for machine and we should think of this as a really annoying way of implementing a function that returns true or false depending on the status.

Concretely, printing in this will just generate lots of spam in the logs (since this will be polled all the time).

# client creation or ping fails, we will still exit with a non-zero
# exit code.
redis_client.ping()
sys.exit(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this supposed to only return exist statuses?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it purely returns exit codes

@AmeerHajAli AmeerHajAli added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2021
@wuisawesome
Copy link
Copy Markdown
Contributor Author

Looks like there might be an edge case when starting with ray start --head --ray-client-server-port. i need to look into it before we can merge.

@DmitriGekhtman
Copy link
Copy Markdown
Contributor

Is it straightforward to test the edge-case fix?

@wuisawesome
Copy link
Copy Markdown
Contributor Author

I think I already did? The tests are looking good, i'll push a skip on windows, then let them run one more time.

@wuisawesome
Copy link
Copy Markdown
Contributor Author

serve, rllib, and multi tenancy (also passes locally) tests are all known to be flaky. merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants