Autoscaler kills workers with assigned tasks (race condition)

# Checklist

- [x] I have verified that the issue exists against the `master` branch of Celery.
- [ ] This has already been asked to the [discussions forum](https://github.com/celery/celery/discussions) first.
- [x] I have read the relevant section in the
 [contribution guide](http://docs.celeryproject.org/en/latest/contributing.html#other-bugs)
 on reporting bugs.
- [x] I have checked the [issues list](https://github.com/celery/celery/issues?q=is%3Aissue+label%3A%22Issue+Type%3A+Bug+Report%22+-label%3A%22Category%3A+Documentation%22)
 for similar or identical bug reports.
- [x] I have checked the [pull requests list](https://github.com/celery/celery/pulls?q=is%3Apr+label%3A%22PR+Type%3A+Bugfix%22+-label%3A%22Category%3A+Documentation%22)
 for existing proposed fixes.
- [x] I have checked the [commit log](https://github.com/celery/celery/commits/master)
 to find out if the bug was already fixed in the master branch.
- [x] I have included all related issues and possible duplicate issues
 in this issue (If there are none, check this box anyway).

## Mandatory Debugging Information

- [ ] I have included the output of ``celery -A proj report`` in the issue.
 (if you are not able to do this, then at least specify the Celery
 version affected).
- [x] I have verified that the issue exists against the `master` branch of Celery.
- [ ] I have included the contents of ``pip freeze`` in the issue.
- [ ] I have included all the versions of all the external dependencies required
 to reproduce this bug.

## Optional Debugging Information

- [ ] I have tried reproducing the issue on more than one Python version
 and/or implementation.
- [ ] I have tried reproducing the issue on more than one message broker and/or
 result backend.
- [ ] I have tried reproducing the issue on more than one version of the message
 broker and/or result backend.
- [ ] I have tried reproducing the issue on more than one operating system.
- [ ] I have tried reproducing the issue on more than one workers pool.
- [ ] I have tried reproducing the issue with autoscaling, retries,
 ETA/Countdown & rate limits disabled.
- [ ] I have tried reproducing the issue after downgrading
 and/or upgrading Celery and its dependencies.

## Related Issues and Possible Duplicates


#### Related Issues

- None

#### Possible Duplicates

- None

## Environment & Settings

**Celery version**: 5.2.0 (dawn-chorus)

<details>
<summary><code>celery report</code> Output:</summary>


```
```


</details>

# Steps to Reproduce

## Required Dependencies

* **Minimal Python Version**: N/A or Unknown
* **Minimal Celery Version**: N/A or Unknown
* **Minimal Kombu Version**: N/A or Unknown
* **Minimal Broker Version**: N/A or Unknown
* **Minimal Result Backend Version**: N/A or Unknown
* **Minimal OS and/or Kernel Version**: N/A or Unknown
* **Minimal Broker Client Version**: N/A or Unknown
* **Minimal Result Backend Client Version**: N/A or Unknown

### Python Packages

<details>
<summary><code>pip freeze</code> Output:</summary>


```
```


</details>

### Other Dependencies

<details>

N/A

</details>

## Minimally Reproducible Test Case


<details>



</details>

# Expected Behavior

Tasks executing
# Actual Behavior

Traceback (most recent call last):
 File "<...>/venv/lib/python3.8/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
 raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 9.


There is a race condition in the autoscaler logic which causes the autoscaler to kill processes which are just about to execute a task, killing the given task/job. The problem is here, in billiard.pool.shrink, which is called by the autoscaler in scale_down():

```python
 def shrink(self, n=1):
 for i, worker in enumerate(self._iterinactive()):
 self._processes -= 1
 if self._putlock:
 self._putlock.shrink()
 worker.terminate_controlled()
 self.on_shrink(1)
```
The code in _iterinactive checks for active workers like this:
```python
 def _worker_active(self, worker):
 for job in values(self._cache):
 if worker.pid in job.worker_pids():
 return True
 return False
```
The problem is that job.worker_pids() is only properly set *after* the MainProcess receives and processes the ack from the poolworker. For example, this is _ack() in billiard.pool.ApplyResult:
```python
 def _ack(self, i, time_accepted, pid, synqW_fd):
 with self._mutex:
 if self._cancelled and self._send_ack:
 self._accepted = True
 if synqW_fd:
 return self._send_ack(NACK, pid, self._job, synqW_fd)
 return
 self._accepted = True
 self._time_accepted = time_accepted
 self._worker_pid = pid
```
What we see happening is that jobs are received by the MainProcess and sent out to PoolWorkers in billiard.pool.apply_async(), but while the worker is in the process of accepting the job (the ack hasn't yet been received by the MainProcess) the autoscaler kicks in and kills the process which picked up the job, since that job in the MainProcess hasn't set self._worker_pid = pid yet, and _worker_active() returns false for that worker (despite the fact that the worker has already picked up the job).

A possible fix is changing the code in _worker_active() to this:
```python
 def _worker_active(self, worker):
 for job in values(self._cache):
 worker_pids = job.worker_pids()
 if not worker_pids or worker.pid in worker_pids:
 return True
 return False
```
This is rather crude since it will cause **all** workers to claim to be busy if there are jobs sent out to workers (self._cache is populated at that point) but not acked yet, just in case any given worker has picked up a job but the ack has not yet been recieved. But I don't know of there is a better way of preventing the autoscaler from killing jobs which are already accepted by a poolworker but for which the ack is still in flight. Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autoscaler kills workers with assigned tasks (race condition) #7266

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Autoscaler kills workers with assigned tasks (race condition) #7266

Description

Checklist

Mandatory Debugging Information

Optional Debugging Information

Related Issues and Possible Duplicates

Related Issues

Possible Duplicates

Environment & Settings

Steps to Reproduce

Required Dependencies

Python Packages

Other Dependencies

Minimally Reproducible Test Case

Expected Behavior

Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions