worker stops consuming tasks after rabbitmq reconnection on celery 5 #9095

bdoublet91 · 2024-06-25T14:07:23Z

bdoublet91
Jun 25, 2024

Hi,

I started this discussion related to #7276 on redis but this issue is also in rabbitmq.

I am experiencing an issue with celery 5.3 after migrated from celery 4.4 to use the new setting broker lost channel retry

I am using rabbitmq (3.11) as the message broker. Sometimes the worker stops consuming tasks indefinitely after the worker restarts for whatever reasons. Once I force a restart of the worker, the worker starts to consumme task again.

Only if I run celery 5 worker without heartbeat/gossip/mingle this does not happen and I can restart rabbitmq without the worker stopping to consume tasks after it reconnects to it.

I am running the worker with the following options

/opt/env/bin/celery --app=app worker --hostname=test@%%n -l %(ENV_CELERY_LOGLEVEL)s -Q test -Ofair

Thanks

awmackowiak · 2024-06-27T19:30:16Z

awmackowiak
Jun 27, 2024

Hi @bdoublet91 ,
I have prepared a basic development environment for testing the connection problems with Rabbitmq.
But I cannot make a situation where my celery workers cannot reconnect to the RabbitMQ and stop processing the tasks.
I used:

docker image rabbitmq 3.11-management

Package versions:

amqp==5.1.1
kombu==5.3.7
celery==5.3.0
requests==2.32.3 @Nusnus :)

Can you provide some configurations so I can try to reproduce your situation?

3 replies

Nusnus Jun 27, 2024
Maintainer

celery/kombu#2041 :)

crazyscientist Aug 29, 2024

Hi there,

for quite some time we have used Celery 5 with RabbitMQ 3.8 as broker in a classical setup (i.e. RabbitMQ and Celery worker run as systemd services in a VM) without any issues.

For our latest project, we decided to use a modern setup (i.e. micro service architecture on RKE2 (Rancher Prime) with the same components: Celery 5 and RabbitMQ 3.13. Now, we seem to have the exact problem stated in the title of this discussion.

At some point, a worker cannot acknowledge a message and raises an exception. From that point on, the Celery logs show reoccurring exceptions claiming that the connection was refused. This goes so far that in the pod running a worker, which is stuck in such a loop, that I even get this error when using celery -A ... inspect ping.

Currently, the only known way to recover is to restart the pod.

This are the tracebacks for the mentioned exceptions:

amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
  File "/usr/local/lib/python3.11/site-packages/amqp/method_framing.py", line 53, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 538, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/amqp/abstract_channel.py", line 156, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.11/site-packages/amqp/channel.py", line 293, in _on_close
    raise error_for_code(

and

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 472, in _reraise_as_library_errors
    yield
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 459, in _ensure_connection
    return retry_over_time(
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/utils/functional.py", line 318, in retry_over_time
    return fun(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 934, in _connection_factory
    self._connection = self._establish_connection()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 860, in _establish_connection
    conn = self.transport.establish_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/transport/pyamqp.py", line 203, in establish_connection
    conn.connect()
  File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 324, in connect
    self.transport.connect()
  File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 129, in connect
    self._connect(self.host, self.port, self.connect_timeout)
  File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 184, in _connect
    self.sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/celery", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/__main__.py", line 15, in main
    sys.exit(_main())
             ^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/bin/celery.py", line 236, in main
    return celery(auto_envvar_prefix="CELERY")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/bin/base.py", line 135, in caller
    return f(ctx, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/bin/control.py", line 186, in inspect
    replies = inspect._request(command, **arguments)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/app/control.py", line 106, in _request
    return self._prepare(self.app.control.broadcast(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/celery/app/control.py", line 776, in broadcast
    return self.mailbox(conn)._broadcast(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/pidbox.py", line 330, in _broadcast
    chan = channel or self.connection.default_channel
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 953, in default_channel
    self._ensure_connection(**conn_opts)
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 458, in _ensure_connection
    with ctx():
  File "/usr/lib64/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.11/site-packages/kombu/connection.py", line 476, in _reraise_as_library_errors
    raise ConnectionError(str(exc)) from exc
kombu.exceptions.OperationalError: [Errno 111] Connection refused

Our env:

Python             3.11.9
amqp               5.2.0
billiard           4.2.0
celery             5.4.0
kombu              5.4.0
requests           2.32.3

tkoft Feb 22, 2025

Is celery/kombu#2041 a fix for this issue?

bdoublet91 · 2024-12-13T06:58:32Z

bdoublet91
Dec 13, 2024
Author

Hi,
We continue to get disconnection about celery with rabbitmq.
I Could see that celery reconnect to rabbitmq but refuse to process new tasks unless I restart the celery process.
I have some close connexion from rabbitmq like precondition failed (reach consummer timeout of 2h) or also when channel max limit is reached, I think it often happens when rabbitmq close unexpectedly the celery connection.
It also better when I disable all cluster communication feature like mingle, gossip and heartbeat -> maybe on high number of celery worker, it's flooding celery and rabbitmq (we have 50 celery with 32 concurrency).

Don't know if it can explains something ? For now I'm gonna split celery to have one celery per queue (split debug and perf)

1 reply

bashirmindee Jan 13, 2025

we have the same problem.

tkoft · 2025-02-22T21:51:09Z

tkoft
Feb 22, 2025

Just adding a datapoint here: we're experiencing the same issue where the worker will stall after reconnect. We thought this was a rabbitmq issue at first with nodes going down, so we switched to classic mirrored queues (since quorum queues aren't supported by celery stable yet). Seems like we're still getting this issue intermittently.

Our setup is Celery 5.2.7, kombu 5.3.0, rabbitmq 3.12.

Worker is a single instance, prefork, with autoscale=0,5.

Generally it'll go like this:

A rabbitmq node will go down for maintenance or whatever reason
Worker will disconnect and retry connection
Since we have multiple rmq nodes, worker reconnects pretty much immediately
We'll usually get this error upon reconnect, but unclear if it's related:

Traceback (most recent call last):
  File "/opt/level99/.local/lib/python3.8/site-packages/celery/concurrency/asynpool.py", line 214, in iterate_file_descriptors_safely
    hub_method(fd, *hub_args, **hub_kwargs)
  File "/opt/level99/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 221, in add_reader
    return self.add(fds, callback, READ | ERR, args)
  File "/opt/level99/.local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 171, in add
    self.poller.register(fd, flags)
  File "/opt/level99/.local/lib/python3.8/site-packages/kombu/utils/eventio.py", line 66, in register
    self._epoll.register(fd, events)
OSError: [Errno 9] Bad file descriptor

For some reason, a few more tasks will get consumed by the worker and completed. Last time one more task was also received, but not completed
A minute or so later, the worker will stall entirely, radio silence from logs, and messages start piling up in our rabbitmq queue
We'll notice the queue size growing and manually restart the worker, at which point it starts consuming normally again

One thing I've been meaning to try next time this happens is to see if the connection is still there from the rabbitmq side--I'm guessing it's not or rabbitmq would complain about trying to deliver messages, but maybe they are getting delivered/prefetched and not acked?

2 replies

auvipy May 28, 2025
Maintainer

help us to reproduce this on celery 5.5.x+ please

bdoublet91 Oct 21, 2025
Author

Hi @auvipy, what kind of information or debug do you need ?
I can enable debug log on my celery in production and give you all information needed. We use different celery worker for different worklow and this behavior doesnt happen on all celery so I guess this is a particular celeru use case or setup that create this problem of no processing tasks after rabbitmq reconnection.

korolenkowork · 2025-05-22T22:26:39Z

korolenkowork
May 22, 2025

Any updates on this?

0 replies

asktosell · 2025-09-08T08:08:28Z

asktosell
Sep 8, 2025

We are also experiencing this issue every one or two weeks. Any workarounds?

0 replies

EOV-NhatVM · 2025-10-03T11:49:04Z

EOV-NhatVM
Oct 3, 2025

Hii everyone, I have a question relate to topic "Celery don't comsume task from RabbitMQ":

kombu 5.5.4, amqp 5.3.1, celery 5.5.3, rabbitmq 3.13.7
first, i initialize Celery with broker is RabbitMQ and backend is Redis.
after that, i have task in Queue, however, celery don't consume it and run task.
I delete pod celery -> k8s auto restart other celery pod, and celery consumes tasks from rabbitmq now. Once the task is completed, celery does not consume the next task in the queue.

Could you guys give me some solutions ?

Thanks all for your help !!!!

1 reply

Themanwhosmellslikesugar Oct 30, 2025

@EOV-NhatVM, This may sound like silly advice, but I had an identical situation and it turned out that the Rabbit cluster running in the test environment was incorrectly configured and the nodes couldn't see each other. I checked this by going to all three nodes and running rabbitmqctl cluster_status. Therefore, messages could only go to one of the three nodes, and the worker also connected to a random one.

bdoublet91 · 2025-10-23T23:23:00Z

bdoublet91
Oct 23, 2025
Author

Actually, it's this stack trace that cause the celery restart and reconnect, received a task and never execute it. Hang from 2 days in production ....
We have set all parameters to reconnect celery in case of faillure. It reconnects in log and that's all...
Celery 5.5

025-10-23 21:05:59.453	
[2025-10-23 21:05:59,272: INFO/MainProcess] Task execute_scan[9144f744-dafe-4c27-b79b-b10021046bfa] received
	2025-10-23 21:05:59.204	
[2025-10-23 21:05:59,145: INFO/MainProcess] Connected to amqp://**:**@rabbitmq:5672//
	2025-10-23 21:05:59.204	
The prefetch count will be gradually restored to 10 as the tasks complete processing.
	2025-10-23 21:05:59.204	
[2025-10-23 21:05:59,117: INFO/MainProcess] Temporarily reducing the prefetch count to 1 to avoid over-fetching since 20 tasks are currently being processed.
	2025-10-23 21:05:59.204	

	2025-10-23 21:05:59.204	
  warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
	2025-10-23 21:05:59.204	

	2025-10-23 21:05:59.204	
setting. In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
	2025-10-23 21:05:59.204	
back to the queue. You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss
	2025-10-23 21:05:59.204	
These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered
	2025-10-23 21:05:59.204	
on connection loss cancels all currently executed tasks with late acknowledgement enabled.
	2025-10-23 21:05:59.204	
In Celery 5.1 we introduced an optional breaking change which
	2025-10-23 21:05:59.204	
[2025-10-23 21:05:59,117: WARNING/MainProcess] /engine/.venv/lib/python3.13/site-packages/celery/worker/consumer/consumer.py:391: CPendingDeprecationWarning: 
	2025-10-23 21:05:59.204	
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 14000000 ms. This timeout value can be configured, see consumers doc guide to learn more
	2025-10-23 21:05:59.204	
    )
	2025-10-23 21:05:59.204	
        reply_code, reply_text, (class_id, method_id), ChannelError,
	2025-10-23 21:05:59.204	
    raise error_for_code(
	2025-10-23 21:05:59.204	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/channel.py", line 293, in _on_close
	2025-10-23 21:05:59.204	
    ~~~~~~~~^^^^^^^
	2025-10-23 21:05:59.204	
    listener(*args)
	2025-10-23 21:05:59.204	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/abstract_channel.py", line 156, in dispatch_method
	2025-10-23 21:05:59.204	
    ^
	2025-10-23 21:05:59.204	
    )
	2025-10-23 21:05:59.204	
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	2025-10-23 21:05:59.204	
        method_sig, payload, content,
	2025-10-23 21:05:59.203	
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
	2025-10-23 21:05:59.203	
    return self.channels[channel_id].dispatch_method(
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/connection.py", line 538, in on_inbound_method
	2025-10-23 21:05:59.203	
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	2025-10-23 21:05:59.203	
    callback(channel, method_sig, buf, None)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/method_framing.py", line 53, in on_frame
	2025-10-23 21:05:59.203	
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^
	2025-10-23 21:05:59.203	
    return self.on_inbound_frame(frame)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/connection.py", line 532, in blocking_read
	2025-10-23 21:05:59.203	
              ~~~~~~~~~~~~~~~~~~^^^^^^^^^
	2025-10-23 21:05:59.203	
    while not self.blocking_read(timeout):
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/amqp/connection.py", line 526, in drain_events
	2025-10-23 21:05:59.203	
    ~~~~~~~~~~~~^^^^^^^^^^^
	2025-10-23 21:05:59.203	
    drain_events(timeout=0)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/kombu/transport/base.py", line 230, in _read
	2025-10-23 21:05:59.203	
    ~~~~~~^^^^^^
	2025-10-23 21:05:59.203	
    reader(loop)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/kombu/transport/base.py", line 248, in on_readable
	2025-10-23 21:05:59.203	
    ~~^^^^^^^^^
	2025-10-23 21:05:59.203	
    cb(*cbargs)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/kombu/asynchronous/hub.py", line 373, in create_loop
	2025-10-23 21:05:59.203	
    ~~~~^^^^^^
	2025-10-23 21:05:59.203	
    next(loop)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/celery/worker/loops.py", line 97, in asynloop
	2025-10-23 21:05:59.203	
    ~~~~~~^^^^^^^^^^^^^^^^
	2025-10-23 21:05:59.203	
    c.loop(*c.loop_args())
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/celery/worker/consumer/consumer.py", line 746, in start
	2025-10-23 21:05:59.203	
    ~~~~~~~~~~^^^^^^^^
	2025-10-23 21:05:59.203	
    step.start(parent)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/celery/bootsteps.py", line 116, in start
	2025-10-23 21:05:59.203	
    ~~~~~~~~~~~~~~~^^^^^^
	2025-10-23 21:05:59.203	
    blueprint.start(self)
	2025-10-23 21:05:59.203	
  File "/engine/.venv/lib/python3.13/site-packages/celery/worker/consumer/consumer.py", line 340, in start
	2025-10-23 21:05:59.203	
Traceback (most recent call last):
	2025-10-23 21:05:59.203	
[2025-10-23 21:05:59,115: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...

@auvipy do you have any idea of the issue with this stack trace ? :)

Thanks

2 replies

auvipy Oct 30, 2025
Maintainer

#9095 (reply in thread) you might check this

bdoublet91 Oct 30, 2025
Author

Im my case I have a standalone rabbitmq instance not a cluster in version 4.1.0

bdoublet91 · 2025-10-30T12:17:05Z

bdoublet91
Oct 30, 2025
Author

Any update please ? Could I give more debug information ?
We created a grafana alert to check if queue has 0 consumer as a workaround.

Also I confirm that depend on the running code on celery, we did some modification on our function and the consumer started to fail last week. But here the main problem is that celery reconnect to rabbitmq, received a task but never execute it

[2025-10-23 21:05:59,272: INFO/MainProcess] Task execute_scan[9144f744-dafe-4c27-b79b-b10021046bfa] received
	2025-10-23 21:05:59.204	
[2025-10-23 21:05:59,145: INFO/MainProcess] Connected to amqp://**:**@rabbitmq:5672//
	2025-10-23 21:05:59.204	
The prefetch count will be gradually restored to 10 as the tasks complete processing.
	2025-10-23 21:05:59.204	
[2025-10-23 21:05:59,117: INFO/MainProcess] Temporarily reducing the prefetch count to 1 to avoid over-fetching since 20 tasks are currently being processed.
	2025-10-23 21:05:59.204

Also rabbitmq management interface show 0 consumers until I restart the celery container

0 replies

ianis-c · 2025-11-01T00:53:32Z

ianis-c
Nov 1, 2025

Same issue here. One task will be very long, longer than the consumer_timeout (and we're using acks_late) so we get a PRECONDITION FAILED message and broker disconnection.
Then it fails to reconnect.

Also we have Keda autoscaling in our kubernetes cluster that pops more workers based on CPU usage / count of messages in the consumed queue.
We also have the celery worker autoscale option enabled so in each pod we have between 1 and 10 workers processes.

I've seen some people suggesting to use py-amqp instead of amqp in their broker url - it could me more stable ?
Could this improve the reconnection process?
What do you think @bdoublet91 and @auvipy ?

Thank you

8 replies

ianis-c Nov 5, 2025

-> Added you on Linkedin

Thanks for the details on the amqp libs.

Looking back at our logs it seems that our workers do reconnect to the broker after the disconnection.
Indeed the acks_late=True mode force the worker to keep a TCP connection open to the broker during the whole time (from getting the message (broker to worker), to processing it, then send the ACK back).
Having long running tasks in Celery is not recommended even if it works correctly most of the time, and if it do fail, youre still backed by the retry mechanism.
That being said, the default consumer_timeout for RabbitMQ (it's the max duration for consumers' connections) is 30minutes so if you have tasks that run in 25 or more, you will experience disconnections (all the time, or during high load that will slow your tasks from running in 25 to maybe 30 or more minutes).
-> On our side we used Flower monitoring tool to sort Tasks by "Runtime" to get the max time it takes, then multiplied it by two and used that result as the new consumer_timeout on our RabbitMQ server.

The warning you're getting "In Celery 5.1 we introduced an optional breaking change which on connection loss" suggest to enable this option "worker_cancel_long_running_tasks_on_connection_loss". In fact, when the TCP connection is dropped and you see the "PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out.", you know that the currently executing task will NEVER be ACKed. So in my opinion it should be cancelled and that's exactly the purpose of that option (that will be default in Celery 6).

bdoublet91 Nov 5, 2025
Author

Hi, thanks for your answer. Im quite inactif on linkedin, could you please send me an email (first_name.name@patrowl.io).

Indeed we have activated the acks_late option on this celery worker. For the consumer_timeout, we did the same job as you, set up to 3 hours for long running tasks.
So you advise me to enable this option worker_cancel_long_running_tasks_on_connection_loss or to disable acks_late if possible ?
Did you test it ?
Thanks again

ianis-c Nov 5, 2025

I think that a weakness of Celery is it's large compatibility with many brokers types (not opinionated enough to keep it's supported brokers down to a list of 2 or 3).
It makes having defaults very complicated and many brokers don't have good config out of the box.
For example i think acks_late should be default for RabbitMQ.
And yes I don't think tasks that can't be ACKed should continue running, this makes no sense in theory.
Yes on our side we activated 1 week ago.

bdoublet91 Nov 5, 2025
Author

⚙️ Background

When you use:

@celery.task(acks_late=True)

Celery’s AMQP consumer (kombu) maintains a TCP connection and channel to RabbitMQ.

Each worker process fetches tasks from the queue and only sends an ACK (message acknowledgment) back after the task finishes successfully.

So during a long-running task:

The task message remains unacknowledged in RabbitMQ.
The TCP connection between the worker and RabbitMQ carries that “in-flight” state.

🧨 What Happens When the Connection Drops

Step 1: Connection lost

When the TCP connection between worker ↔ RabbitMQ breaks:

The worker process loses its channel and consumer state.
The task is still running locally, but RabbitMQ considers the message unacknowledged.
RabbitMQ waits briefly, then detects the lost connection and requeues the message.

So now:

RabbitMQ thinks: “Worker died — requeue the message.”
Worker thinks: “I’m still working — I’ll ACK later.”

They’re now out of sync.

🔁 Step 2: Worker Reconnects

Celery has built-in reconnection logic via Kombu. It will:

Reconnect to RabbitMQ.
Re-declare queues and consumers.

However — here’s the tricky part:

After reconnecting, the worker no longer has the original channel or delivery_tag for the in-flight message.

That means:

It cannot ACK the original message anymore.
Even if it finishes the task, the broker won’t know.
RabbitMQ already requeued it.
The task will likely be processed again by the same or another worker.

🤯 The “Zombie” Task Effect

Even worse, depending on timing, the worker may appear connected but idle — because its current pool slot is still running a “zombie” task that’s no longer tracked in the broker.

So yes, you can see “worker connected but not processing tasks” symptoms:

Worker reconnects successfully.
The active pool slots are busy with zombie tasks.
Celery logs show no new tasks being consumed.
RabbitMQ shows tasks in the queue, waiting.

🧩 Why It Happens

This is because the ACK state lives on the AMQP channel, not in Celery’s memory.
Once that channel is gone (TCP reset, broker restart, timeout, etc.), all pending unacked messages are invalidated.

ianis-c Nov 5, 2025

Yes, my understanding is that "worker_cancel_long_running_tasks_on_connection_loss" solves exactly this problem.
Also on our side we're reviewing long tasks to try to optimize some parts of it and make it quicker to process. This is indeed another improvement path.

Uh oh!

worker stops consuming tasks after rabbitmq reconnection on celery 5 #9095

Uh oh!

Uh oh!

Replies: 9 comments · 17 replies

Uh oh!

Uh oh!

Nusnus Jun 27, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

bdoublet91 Dec 13, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

auvipy May 28, 2025 Maintainer

Uh oh!

bdoublet91 Oct 21, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdoublet91 Oct 23, 2025 Author

Uh oh!

auvipy Oct 30, 2025 Maintainer

Uh oh!

bdoublet91 Oct 30, 2025 Author

Uh oh!

bdoublet91 Oct 30, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdoublet91 Nov 5, 2025 Author

Uh oh!

Uh oh!

Uh oh!

bdoublet91 Nov 5, 2025 Author

⚙️ Background

🧨 What Happens When the Connection Drops

Step 1: Connection lost

🔁 Step 2: Worker Reconnects

🤯 The “Zombie” Task Effect

🧩 Why It Happens

Uh oh!

Replies: 9 comments 17 replies

Nusnus Jun 27, 2024
Maintainer

bdoublet91
Dec 13, 2024
Author

auvipy May 28, 2025
Maintainer

bdoublet91 Oct 21, 2025
Author

bdoublet91
Oct 23, 2025
Author

auvipy Oct 30, 2025
Maintainer

bdoublet91 Oct 30, 2025
Author

bdoublet91
Oct 30, 2025
Author

bdoublet91 Nov 5, 2025
Author

bdoublet91 Nov 5, 2025
Author