-
-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Checklist
- I have verified that the issue exists against the
masterbranch of Celery. - This has already been asked to the discussions forum first.
- I have read the relevant section in the
contribution guide
on reporting bugs. - I have checked the issues list
for similar or identical bug reports. - I have checked the pull requests list
for existing proposed fixes. - I have checked the commit log
to find out if the bug was already fixed in the master branch. - I have included all related issues and possible duplicate issues
in this issue (If there are none, check this box anyway).
Mandatory Debugging Information
- I have included the output of
celery -A proj reportin the issue.
(if you are not able to do this, then at least specify the Celery
version affected). - I have verified that the issue exists against the
masterbranch of Celery. - I have included the contents of
pip freezein the issue. - I have included all the versions of all the external dependencies required
to reproduce this bug.
Optional Debugging Information
- I have tried reproducing the issue on more than one Python version
and/or implementation. - I have tried reproducing the issue on more than one message broker and/or
result backend. - I have tried reproducing the issue on more than one version of the message
broker and/or result backend. - I have tried reproducing the issue on more than one operating system.
- I have tried reproducing the issue on more than one workers pool.
- I have tried reproducing the issue with autoscaling, retries,
ETA/Countdown & rate limits disabled. - I have tried reproducing the issue after downgrading
and/or upgrading Celery and its dependencies.
Related Issues and Possible Duplicates
Related Issues
- None
Possible Duplicates
- None
Environment & Settings
Celery version: 5.2.0 (dawn-chorus)
celery report Output:
Steps to Reproduce
Required Dependencies
- Minimal Python Version: N/A or Unknown
- Minimal Celery Version: N/A or Unknown
- Minimal Kombu Version: N/A or Unknown
- Minimal Broker Version: N/A or Unknown
- Minimal Result Backend Version: N/A or Unknown
- Minimal OS and/or Kernel Version: N/A or Unknown
- Minimal Broker Client Version: N/A or Unknown
- Minimal Result Backend Client Version: N/A or Unknown
Python Packages
pip freeze Output:
Other Dependencies
Details
N/A
Minimally Reproducible Test Case
Details
Expected Behavior
Tasks executing
Actual Behavior
Traceback (most recent call last):
File "<...>/venv/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 9.
There is a race condition in the autoscaler logic which causes the autoscaler to kill processes which are just about to execute a task, killing the given task/job. The problem is here, in billiard.pool.shrink, which is called by the autoscaler in scale_down():
def shrink(self, n=1):
for i, worker in enumerate(self._iterinactive()):
self._processes -= 1
if self._putlock:
self._putlock.shrink()
worker.terminate_controlled()
self.on_shrink(1)The code in _iterinactive checks for active workers like this:
def _worker_active(self, worker):
for job in values(self._cache):
if worker.pid in job.worker_pids():
return True
return FalseThe problem is that job.worker_pids() is only properly set after the MainProcess receives and processes the ack from the poolworker. For example, this is _ack() in billiard.pool.ApplyResult:
def _ack(self, i, time_accepted, pid, synqW_fd):
with self._mutex:
if self._cancelled and self._send_ack:
self._accepted = True
if synqW_fd:
return self._send_ack(NACK, pid, self._job, synqW_fd)
return
self._accepted = True
self._time_accepted = time_accepted
self._worker_pid = pidWhat we see happening is that jobs are received by the MainProcess and sent out to PoolWorkers in billiard.pool.apply_async(), but while the worker is in the process of accepting the job (the ack hasn't yet been received by the MainProcess) the autoscaler kicks in and kills the process which picked up the job, since that job in the MainProcess hasn't set self._worker_pid = pid yet, and _worker_active() returns false for that worker (despite the fact that the worker has already picked up the job).
A possible fix is changing the code in _worker_active() to this:
def _worker_active(self, worker):
for job in values(self._cache):
worker_pids = job.worker_pids()
if not worker_pids or worker.pid in worker_pids:
return True
return FalseThis is rather crude since it will cause all workers to claim to be busy if there are jobs sent out to workers (self._cache is populated at that point) but not acked yet, just in case any given worker has picked up a job but the ack has not yet been recieved. But I don't know of there is a better way of preventing the autoscaler from killing jobs which are already accepted by a poolworker but for which the ack is still in flight. Thoughts?