[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180
[jobs] Monitor jobs in the background to avoid requiring clients to poll#22180edoakes merged 4 commits intoray-project:masterfrom
Conversation
| else: | ||
| return pickle.loads(pickled_status) | ||
|
|
||
| def get_all_jobs(self) -> Dict[str, JobStatusInfo]: |
There was a problem hiding this comment.
@architkulkarni @tchordia this is also a step towards being able to support listing jobs: we just need to add some more metadata and call this interface.
architkulkarni
left a comment
There was a problem hiding this comment.
Looks good to me! Just had a few minor questions.
|
|
||
| JOB_STATUS_KEY = "_ray_internal_job_status_{job_id}" | ||
| JOB_STATUS_KEY_PREFIX = "_ray_internal_job_status" | ||
| JOB_STATUS_KEY = f"{JOB_STATUS_KEY_PREFIX}_{{job_id}}" |
There was a problem hiding this comment.
haha yeah, it's a little ugly but the best I could come up with...
| # exiting is expected. | ||
| pass | ||
| elif isinstance(e, RuntimeEnvSetupError): | ||
| logger.info(f"Failed to set up runtime_env for job {job_id}.") |
There was a problem hiding this comment.
Why info and not error here? (I know this isn't your change, but just curious)
There was a problem hiding this comment.
ah, this was my code originally too :)
this is INFO because it is a user error, not an error in job submission, and these are system-level logs. from the standpoint of the job submission server, this is a standard thing that can happen
| raise RuntimeError(message) | ||
|
|
||
|
|
||
| async def async_wait_for_condition( |
There was a problem hiding this comment.
Nice! Surprised we never needed a util function like this before
There was a problem hiding this comment.
me too.... also, I really wish there was a good way to share code between sync & async code in python. it's sad I needed to copy-paste this
Why are these changes needed?
As it stands, job supervisor actor failures are only detected when the client polls for status. This updates the logic to spawn an asyncio task to monitor each job actor so we can catch these in the background.
This simplifies the error handling logic & removes the need to use the private
ref._on_completedinterface as well.Also adds ability to recover from failure in case the dashboard is restarted.
Related issue number
Closes #22181
Checks
scripts/format.shto lint the changes in this PR.