Skip to content

Conversation

@dfm88
Copy link
Contributor

@dfm88 dfm88 commented Jun 19, 2025

Description

(Fixes #9773)

When using chords with a group containing failing tasks and a chain body, error propagation fails with:

ValueError: task_id must not be empty. Got None instead.

This happens because the chord body can be frozen with None as its task ID, causing error handling to break when trying to propagate failures from the group to the callback.

Root Cause

In celery/canvas.py, the _chord.run() method was freezing the body with a potentially None task_id:

# BEFORE (problematic):
bodyres = body.freeze(task_id, root_id=root_id)  # task_id can be None!

The issue:

  • task_id can be None when chord is called without explicit ID
  • When a group task fails, error handling looks for callback.id to propagate the error
  • Since the body was frozen with None, callback.id ends up being None
  • This causes the "task_id must not be empty" error during error storage

Solution

Ensure the chord body always gets a valid, unique task ID while maintaining proper group relationships:

# AFTER (correct):
body_task_id = task_id or uuid()
bodyres = body.freeze(body_task_id, group_id=group_id, root_id=root_id)

Why this is correct:

  • Fixes the immediate issue: Body always has a valid ID for error handling (never None)
  • Maintains uniqueness: Each task gets its own Redis storage key (no collisions)
  • Preserves relationships: Body is properly linked to the group via group_id parameter

Testing

The fix includes tests covering:

  • Error propagation with different chord body types (single task, chain, group, chord)
  • Proper task ID generation when none is provided
  • Verification that body tasks get unique IDs while maintaining group membership
  • Integration tests ensuring no regression in existing functionality

@codecov
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.52%. Comparing base (7054a0e) to head (f9ec6e4).
⚠️ Report is 50 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9774   +/-   ##
=======================================
  Coverage   78.51%   78.52%           
=======================================
  Files         153      153           
  Lines       19129    19130    +1     
  Branches     2533     2533           
=======================================
+ Hits        15020    15021    +1     
  Misses       3821     3821           
  Partials      288      288           
Flag Coverage Δ
unittests 78.49% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@auvipy auvipy requested review from Nusnus and Copilot June 21, 2025 06:32

This comment was marked as outdated.

@auvipy
Copy link
Member

auvipy commented Jun 21, 2025

thanks for the intensive tests!

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@auvipy
Copy link
Member

auvipy commented Jun 21, 2025

the integration tests seems to be very slow

@dfm88
Copy link
Contributor Author

dfm88 commented Jun 21, 2025

the integration tests seems to be very slow

Yes, it seems to be related to the redis and rabbitmq_redis configurations, as those are the workflows still running. I wasn't able to run all the integration tests locally for every matrix configuration, but I'll try to take a closer look.

@auvipy
Copy link
Member

auvipy commented Jun 21, 2025

seems redis related integration test builds failed mostly.... may be due to timeout / network issues?

@dfm88
Copy link
Contributor Author

dfm88 commented Jun 22, 2025

seems redis related integration test builds failed mostly.... may be due to timeout / network issues?

So I investigated on the problem, the test that timed out was the one linked below (that runs only on redis backend, that's why only workflows with redis as backend were failing)

def test_chord_soft_timeout_recuperation(self, manager):
"""Test that if soft timeout happens in task but is managed by task,
chord still get results normally
"""
if not manager.app.conf.result_backend.startswith('redis'):
raise pytest.skip('Requires redis result backend.')
c = chord([
# return 3
add.s(1, 2),
# return 0 after managing soft timeout
delayed_sum_with_soft_guard.s(
[100], pause_time=2
).set(
soft_time_limit=1
),
])
result = c(delayed_sum.s(pause_time=0)).get()
assert result == 3

And this is related to the fact that my initial approach used group_id as the task ID for the chord body:

bodyres = body.freeze(group_id, root_id=root_id)

it created Redis key collisions because both the group coordination and body task result were trying to use the same Redis key. This caused the integration tests to timeout waiting for results.

The correct fix I just pushed should be:

body_task_id = task_id or uuid()
bodyres = body.freeze(body_task_id, group_id=group_id, root_id=root_id)

This ensures both the task_id not to be None and the correct link between the body and the header passing the group_id as specific kw argument.

@auvipy
Copy link
Member

auvipy commented Jun 23, 2025

nice finding!

@auvipy auvipy requested a review from Copilot June 23, 2025 05:06

This comment was marked as outdated.

@dfm88 dfm88 force-pushed the fix/9773-chord-fails-taskid-none branch from cde1ef0 to f9ec6e4 Compare June 23, 2025 10:30
@auvipy
Copy link
Member

auvipy commented Jun 23, 2025

you can also update the PR description as the original fix have changed

@auvipy auvipy requested a review from Copilot June 23, 2025 15:29

This comment was marked as outdated.

@auvipy auvipy modified the milestones: 5.5.4, 5.6.0 Jul 13, 2025
ANJAN672 added a commit to ANJAN672/augur that referenced this pull request Jan 13, 2026
Fixes chaoss#3458

Updates Celery from ~=5.5 to >=5.6.0 to fix a bug where task_id becomes
None during exception handling while executing a task chain within a
chord.

This was reported in celery/celery#9774 and fixed in Celery 5.6.0.

Signed-off-by: ANJAN672 <rockyanjan672@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task_id must not be empty with chain as body of a chord

2 participants