Skip to content

db connection closed and worker hangs with celery 4.2+ #4878

@craigds

Description

@craigds

This might be two separate bugs, but they might be related.

Checklist

  • I have included the output of celery -A proj report in the issue.
  • I have verified that the issue exists against the master branch of Celery.

Steps to reproduce

Run multiple celery workers with celery multi, each with its own queue, then pass tasks to them.

Expected behavior

tasks should run on all workers.

Actual behavior

Most workers explode:

Most nodes give this, for every task:

[2018-07-04 02:52:21,280: WARNING/ForkPoolWorker-1] Request: <Context: {'origin': 'gen19542@vm', u'args': [], u'chain': None, 'root_id': '0c008f5a-507d-4b08-8082-0dcc0ab3495d', 'expires': None, u'is_eager': False, u'correlation_id': '0c008f5a-507d-4b08-8082-0dcc0ab3495d', u'chord': None, u'reply_to': '5e48e3d9-0b67-3122-b0c5-9b83efd9e5ca', 'id': '0c008f5a-507d-4b08-8082-0dcc0ab3495d', 'kwargsrepr': '{}', 'lang': 'py', 'retries': 0, 'task': 'dandelion.geostore.celery.debug_task', 'group': None, 'timelimit': [None, None], u'delivery_info': {u'priority': 0, u'redelivered': False, u'routing_key': u'celery', u'exchange': u''}, u'hostname': u'celery@vm', 'called_directly': False, 'parent_id': None, 'argsrepr': '()', u'errbacks': None, u'callbacks': None, u'kwargs': {}, 'eta': None, '_protected': 1}>
[2018-07-04 02:52:21,288: WARNING/ForkPoolWorker-1] /kx/dandelion/venv/local/lib/python2.7/site-packages/celery/app/trace.py:561: RuntimeWarning: Exception raised outside body: OperationalError('could not receive data from server: Bad file descriptor\n',):
Traceback (most recent call last):
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/celery/app/trace.py", line 442, in trace_task
    uuid, retval, task_request, publish_result,
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/celery/backends/base.py", line 144, in mark_as_done
    self.store_result(task_id, result, state, request=request)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/celery/backends/base.py", line 316, in store_result
    request=request, **kwargs)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django_celery_results/backends/database.py", line 29, in _store_result
    meta=meta,
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django_celery_results/managers.py", line 50, in _inner
    return fun(*args, **kwargs)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django_celery_results/managers.py", line 119, in store_result
    obj, created = self.get_or_create(task_id=task_id, defaults=fields)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 473, in get_or_create
    return self.get(**lookup), False
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 379, in get
    num = len(clone)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 238, in __len__
    self._fetch_all()
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 1087, in _fetch_all
    self._result_cache = list(self.iterator())
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/query.py", line 54, in __iter__
    results = compiler.execute_sql()
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 835, in execute_sql
    cursor.execute(sql, params)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 79, in execute
    return super(CursorDebugWrapper, self).execute(sql, params)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/raven/contrib/django/client.py", line 127, in execute
    return real_execute(self, sql, params)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/kx/dandelion/venv/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
OperationalError: could not receive data from server: Bad file descriptor

I guess something's closing the django db connection?

One worker succeeds but then hangs

In another worker (iris), one task actually succeeds without hitting the above message, except that then the worker hangs forever. Telling it to stop with stopwait command just hangs the stopwait command too. There is nothing in the log for this worker from the time the task completes (the task completion is also not logged, but if I insert print statements at the end of the task they end up in the log)

Our init script to run all the workers is here: https://gist.github.com/craigds/42f3ad627d68e94aefa909c2b1a4a0c0

As far as I can tell, the tasks aren't doing anything particularly interesting. Putting pass as the task body doesn't change anything.

Tasks are defined like this:

from celery import shared_task

@shared_task(bind=True, queue='iris')
def queue_slow_rendered_tiles(self):
...

This works well with celery 4.1.1. What other information can I provide that might illuminate the issue?

Theory

The only difference I can think of with the iris worker is that it doesn't use the django database in its tasks. Presumably the other workers, that do use the database, are exploding prior to this hang, whereas the iris one is skipping the db connection close and hitting a different bug. Maybe thats a clue?

celery report output

I grepped the relevant bits, the entirety of our django settings is not relevant here:

software -> celery:4.2.0 (windowlicker) kombu:4.2.1 py:2.7.6
            billiard:3.5.0.3 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:amqp results:django-db


CELERY_BROKER_URL: u'amqp://appname:********@localhost:5672/appname'
CELERY_TASK_QUEUES:
    (<unbound Queue bcast.e39987e0-37c3-4eb7-b63a-dace642ba7da -> <unbound Exchange notification_broadcast(fanout)> -> >,)
CELERY_BEAT_SCHEDULER: 'django_celery_beat.schedulers:DatabaseScheduler'
CELERY_IMPORTS:
    ('appname', 'otherappname')
CELERY_RESULT_BACKEND: 'django-db'
CELERY_RESULT_EXPIRES: 3600

These seem relevant too:

celery==4.2.0
django-celery-beat==1.1.1
django-celery-results==1.0.1
kombu==4.2.1
amqp==2.3.2

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions