Skip to content

[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180

Merged
edoakes merged 4 commits intoray-project:masterfrom
edoakes:job-background-monitor
Feb 7, 2022
Merged

[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180
edoakes merged 4 commits intoray-project:masterfrom
edoakes:job-background-monitor

Conversation

@edoakes
Copy link
Copy Markdown
Collaborator

@edoakes edoakes commented Feb 7, 2022

Why are these changes needed?

As it stands, job supervisor actor failures are only detected when the client polls for status. This updates the logic to spawn an asyncio task to monitor each job actor so we can catch these in the background.

This simplifies the error handling logic & removes the need to use the private ref._on_completed interface as well.

Also adds ability to recover from failure in case the dashboard is restarted.

Related issue number

Closes #22181

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

else:
return pickle.loads(pickled_status)

def get_all_jobs(self) -> Dict[str, JobStatusInfo]:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@architkulkarni @tchordia this is also a step towards being able to support listing jobs: we just need to add some more metadata and call this interface.

Copy link
Copy Markdown
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just had a few minor questions.


JOB_STATUS_KEY = "_ray_internal_job_status_{job_id}"
JOB_STATUS_KEY_PREFIX = "_ray_internal_job_status"
JOB_STATUS_KEY = f"{JOB_STATUS_KEY_PREFIX}_{{job_id}}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested f string 🤯

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha yeah, it's a little ugly but the best I could come up with...

# exiting is expected.
pass
elif isinstance(e, RuntimeEnvSetupError):
logger.info(f"Failed to set up runtime_env for job {job_id}.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why info and not error here? (I know this isn't your change, but just curious)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, this was my code originally too :)

this is INFO because it is a user error, not an error in job submission, and these are system-level logs. from the standpoint of the job submission server, this is a standard thing that can happen

raise RuntimeError(message)


async def async_wait_for_condition(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Surprised we never needed a util function like this before

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

me too.... also, I really wish there was a good way to share code between sync & async code in python. it's sad I needed to copy-paste this

@edoakes edoakes merged commit 8806b2d into ray-project:master Feb 7, 2022
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[jobs] Detect job supervisor failures without client polling

4 participants