Skip to content

Conversation

@srafehi
Copy link
Contributor

@srafehi srafehi commented Apr 21, 2019

This commit fixes a race-condition which can lead to WorkerLostError: Worker exited prematurely: exitcode 155 and WorkerLostError: Worker exited prematurely: exitcode 0 errors.

The race-condition occurs when a message sent by Worker.workloop is not consumed by ResultHandler prior to the worker being cleaned up by Pool._join_exited_workers.

The solution to the race-condition is to keep a count of messages ResultHandler receives from a worker, and for the worker to return only when this count matches its own counter of completed tasks. This ensures that a worker will only be cleaned up once ResultHandler and the Worker are in-sync, or after 30 seconds of waiting for them to be in-sync.


Fixes #278

srafehi added 2 commits April 22, 2019 00:54
This commit fixes a race-condition which can lead to `WorkerLostError: Worker exited prematurely: exitcode 155` and `WorkerLostError: Worker exited prematurely: exitcode 0` errors.

The race-condition occurs when a message sent by `Worker.workloop` is not consumed by `ResultHandler` prior to the worker being cleaned up by `Pool._join_exited_workers`.

The solution to the race-condition is to keep a count of messages `ResultHandler` receives from a worker, and for the worker to return only when this count matches its own counter of completed tasks. This ensures that a worker will only be cleaned up once `ResultHandler` and the `Worker` are in-sync, or after 30 seconds of waiting for them to be in-sync.
Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for tackling this!! The code looks good so far. would you mind adding some tests for these changes? that would let the review easier.

@srafehi
Copy link
Contributor Author

srafehi commented Apr 23, 2019

Yep, definitely. Planning to add some tests during the weekend.

@auvipy
Copy link
Member

auvipy commented May 6, 2019

got the time to add test?

@srafehi
Copy link
Contributor Author

srafehi commented May 7, 2019

got the time to add test?

Struggling to find the time to do so, but I might be able to free up a few hours this Sunday and get it done.

The previous change did not account for "wait_for_job" raising a SystemExit exception. This change wraps the main processing logic of Worker.workloop in a try/finally block which ensures that _ensure_messages_consumed is always called.
@auvipy auvipy merged commit 241c0d4 into celery:master Jun 7, 2019
sbuchnick added a commit to ofek-public/billiard that referenced this pull request Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race-condition in billiard.pool leads to WorkerLostError: Worker exited prematurely: exitcode 155

2 participants