Skip to content

Celery worker OOM due to unbounded prefetch on eta tasks #9849

@AyoubOm

Description

@AyoubOm

Summary

We are encountering Out Of Memory infinite loop issues on celery workers, when our redis queue is suddenly loaded with either lot of failed tasks, or lot of scheduled tasks.

The reason is that a celery worker keeps fetching scheduled tasks that have an eta or countdown until it satisfies the number of tasks ready-to-run reaches the worker_prefetch_multiplier configuration. When the queue contains lot of scheduled tasks (for us, it's thousands of tasks), the worker keeps fetching them, until it is full and goes out of memory. Then those tasks go back to the queue, and get fetched by other workers, which go in their turn out of memory, and so on.

I checked the code and saw that there is currently no limit on prefetching tasks that have eta. Here is the line in strategy.py

Where in kombu, we would increment the prefetch count immediately. In the QoS class.

I was expecting to have a configuration property that can control the limit, but currently there is no such configuration. Could we please add it ?

Checklist

  • I have verified that the issue exists against the main branch of Celery.
  • This has already been asked to the discussions forum first.
  • I have read the relevant section in the
    contribution guide
    on reporting bugs.
  • I have checked the issues list
    for similar or identical bug reports.
  • I have checked the pull requests list
    for existing proposed fixes.
  • I have checked the commit log
    to find out if the bug was already fixed in the main branch.
  • I have included all related issues and possible duplicate issues
    in this issue (If there are none, check this box anyway).
  • I have tried to reproduce the issue with pytest-celery and added the reproduction script below.

Mandatory Debugging Information

  • I have included the output of celery -A proj report in the issue.
    (if you are not able to do this, then at least specify the Celery
    version affected).
  • I have verified that the issue exists against the main branch of Celery.
  • I have included the contents of pip freeze in the issue.
  • I have included all the versions of all the external dependencies required
    to reproduce this bug.

Optional Debugging Information

  • I have tried reproducing the issue on more than one Python version
    and/or implementation.
  • I have tried reproducing the issue on more than one message broker and/or
    result backend.
  • I have tried reproducing the issue on more than one version of the message
    broker and/or result backend.
  • I have tried reproducing the issue on more than one operating system.
  • I have tried reproducing the issue on more than one workers pool.
  • I have tried reproducing the issue with autoscaling, retries,
    ETA/Countdown & rate limits disabled.
  • I have tried reproducing the issue after downgrading
    and/or upgrading Celery and its dependencies.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions