[jobs] Monitor jobs in the background to avoid requiring clients to poll by edoakes · Pull Request #22180 · ray-project/ray

edoakes · 2022-02-07T17:33:47Z

Why are these changes needed?

As it stands, job supervisor actor failures are only detected when the client polls for status. This updates the logic to spawn an asyncio task to monitor each job actor so we can catch these in the background.

This simplifies the error handling logic & removes the need to use the private ref._on_completed interface as well.

Also adds ability to recover from failure in case the dashboard is restarted.

Related issue number

Closes #22181

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

edoakes · 2022-02-07T17:35:24Z

dashboard/modules/job/common.py

        else:
            return pickle.loads(pickled_status)

+    def get_all_jobs(self) -> Dict[str, JobStatusInfo]:


@architkulkarni @tchordia this is also a step towards being able to support listing jobs: we just need to add some more metadata and call this interface.

dashboard/modules/job/tests/test_job_manager.py

architkulkarni

Looks good to me! Just had a few minor questions.

architkulkarni · 2022-02-07T18:05:52Z

dashboard/modules/job/common.py


-    JOB_STATUS_KEY = "_ray_internal_job_status_{job_id}"
+    JOB_STATUS_KEY_PREFIX = "_ray_internal_job_status"
+    JOB_STATUS_KEY = f"{JOB_STATUS_KEY_PREFIX}_{{job_id}}"


nested f string 🤯

haha yeah, it's a little ugly but the best I could come up with...

dashboard/modules/job/job_manager.py

architkulkarni · 2022-02-07T18:17:06Z

dashboard/modules/job/job_manager.py

+                    # exiting is expected.
+                    pass
+                elif isinstance(e, RuntimeEnvSetupError):
+                    logger.info(f"Failed to set up runtime_env for job {job_id}.")


Why info and not error here? (I know this isn't your change, but just curious)

ah, this was my code originally too :)

this is INFO because it is a user error, not an error in job submission, and these are system-level logs. from the standpoint of the job submission server, this is a standard thing that can happen

dashboard/modules/job/job_manager.py

architkulkarni · 2022-02-07T18:31:09Z

python/ray/_private/test_utils.py

    raise RuntimeError(message)


+async def async_wait_for_condition(


Nice! Surprised we never needed a util function like this before

me too.... also, I really wish there was a good way to share code between sync & async code in python. it's sad I needed to copy-paste this

…oll (ray-project#22180)

working

3e1d476

edoakes assigned architkulkarni, jiaodong and tchordia Feb 7, 2022

edoakes commented Feb 7, 2022

View reviewed changes

dashboard/modules/job/tests/test_job_manager.py Show resolved Hide resolved

edoakes added 2 commits February 7, 2022 11:49

fix tests

a666b7a

fix

e74a2e9

architkulkarni approved these changes Feb 7, 2022

View reviewed changes

fix

b6e3268

edoakes merged commit 8806b2d into ray-project:master Feb 7, 2022

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022

[jobs] Monitor jobs in the background to avoid requiring clients to p…

100cb0a

…oll (ray-project#22180)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180

[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180
edoakes merged 4 commits intoray-project:masterfrom
edoakes:job-background-monitor

edoakes commented Feb 7, 2022 •

edited

Loading

Uh oh!

edoakes Feb 7, 2022

Uh oh!

Uh oh!

architkulkarni left a comment

Uh oh!

architkulkarni Feb 7, 2022

Uh oh!

edoakes Feb 7, 2022

Uh oh!

Uh oh!

architkulkarni Feb 7, 2022

Uh oh!

edoakes Feb 7, 2022

Uh oh!

Uh oh!

architkulkarni Feb 7, 2022

Uh oh!

edoakes Feb 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		raise RuntimeError(message)


		async def async_wait_for_condition(

Conversation

edoakes commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

edoakes Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

architkulkarni left a comment

Choose a reason for hiding this comment

Uh oh!

architkulkarni Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

edoakes Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

architkulkarni Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

edoakes Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

architkulkarni Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

edoakes Feb 7, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edoakes commented Feb 7, 2022 •

edited

Loading