Skip to content

Autoscaler kills workers with assigned tasks (race condition) #7266

@leomermelstein

Description

@leomermelstein

Checklist

  • I have verified that the issue exists against the master branch of Celery.
  • This has already been asked to the discussions forum first.
  • I have read the relevant section in the
    contribution guide
    on reporting bugs.
  • I have checked the issues list
    for similar or identical bug reports.
  • I have checked the pull requests list
    for existing proposed fixes.
  • I have checked the commit log
    to find out if the bug was already fixed in the master branch.
  • I have included all related issues and possible duplicate issues
    in this issue (If there are none, check this box anyway).

Mandatory Debugging Information

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
  • I have verified that the issue exists against the master branch of Celery.
  • I have included the contents of pip freeze in the issue.
  • I have included all the versions of all the external dependencies required
    to reproduce this bug.

Optional Debugging Information

  • I have tried reproducing the issue on more than one Python version
    and/or implementation.
  • I have tried reproducing the issue on more than one message broker and/or
    result backend.
  • I have tried reproducing the issue on more than one version of the message
    broker and/or result backend.
  • I have tried reproducing the issue on more than one operating system.
  • I have tried reproducing the issue on more than one workers pool.
  • I have tried reproducing the issue with autoscaling, retries,
    ETA/Countdown & rate limits disabled.
  • I have tried reproducing the issue after downgrading
    and/or upgrading Celery and its dependencies.

Related Issues and Possible Duplicates

Related Issues

  • None

Possible Duplicates

  • None

Environment & Settings

Celery version: 5.2.0 (dawn-chorus)

celery report Output:

Steps to Reproduce

Required Dependencies

  • Minimal Python Version: N/A or Unknown
  • Minimal Celery Version: N/A or Unknown
  • Minimal Kombu Version: N/A or Unknown
  • Minimal Broker Version: N/A or Unknown
  • Minimal Result Backend Version: N/A or Unknown
  • Minimal OS and/or Kernel Version: N/A or Unknown
  • Minimal Broker Client Version: N/A or Unknown
  • Minimal Result Backend Client Version: N/A or Unknown

Python Packages

pip freeze Output:

Other Dependencies

Details

N/A

Minimally Reproducible Test Case

Details

Expected Behavior

Tasks executing

Actual Behavior

Traceback (most recent call last):
File "<...>/venv/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 9.

There is a race condition in the autoscaler logic which causes the autoscaler to kill processes which are just about to execute a task, killing the given task/job. The problem is here, in billiard.pool.shrink, which is called by the autoscaler in scale_down():

    def shrink(self, n=1):
        for i, worker in enumerate(self._iterinactive()):
            self._processes -= 1
            if self._putlock:
                self._putlock.shrink()
            worker.terminate_controlled()
            self.on_shrink(1)

The code in _iterinactive checks for active workers like this:

       def _worker_active(self, worker):
        for job in values(self._cache):
            if worker.pid in job.worker_pids():
                return True
        return False

The problem is that job.worker_pids() is only properly set after the MainProcess receives and processes the ack from the poolworker. For example, this is _ack() in billiard.pool.ApplyResult:

    def _ack(self, i, time_accepted, pid, synqW_fd):
        with self._mutex:
            if self._cancelled and self._send_ack:
                self._accepted = True
                if synqW_fd:
                    return self._send_ack(NACK, pid, self._job, synqW_fd)
                return
            self._accepted = True
            self._time_accepted = time_accepted
            self._worker_pid = pid

What we see happening is that jobs are received by the MainProcess and sent out to PoolWorkers in billiard.pool.apply_async(), but while the worker is in the process of accepting the job (the ack hasn't yet been received by the MainProcess) the autoscaler kicks in and kills the process which picked up the job, since that job in the MainProcess hasn't set self._worker_pid = pid yet, and _worker_active() returns false for that worker (despite the fact that the worker has already picked up the job).

A possible fix is changing the code in _worker_active() to this:

    def _worker_active(self, worker):
      for job in values(self._cache):
        worker_pids = job.worker_pids()
        if not worker_pids or worker.pid in worker_pids:
            return True
    return False

This is rather crude since it will cause all workers to claim to be busy if there are jobs sent out to workers (self._cache is populated at that point) but not acked yet, just in case any given worker has picked up a job but the ack has not yet been recieved. But I don't know of there is a better way of preventing the autoscaler from killing jobs which are already accepted by a poolworker but for which the ack is still in flight. Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions