perf(clp-package): Reduce batch scheduling jitter in the query scheduler. by gibber9809 · Pull Request #1899 · y-scope/clp

gibber9809 · 2026-01-22T20:09:03Z

Description

The query scheduler has extreme jitter in the main task dispatch loop, making task dispatch frequency a direct function of this jitter as described in #1897.

This PR significantly reduces the jitter, bringing the pacing of the main scheduling loop much closer to the configured polling interval, and likewise pulling the maximum task dispatch rate closer to the polling interval. The following chart shows the jitter and breakdown of time spent over a small time slice of a search query as batches of search tasks are completed.

When compared to the chart in #1897, you can see that the main source of the jitter (long waits for retrieving celery results) has been significantly reduced, and pacing for the main scheduling loop is much closer to the configured rate (configured to once every 0.1 seconds in this case).

Note that while the pacing still isn't perfect, the part of this diff that attempts to explicitly pace the scheduling loop does help (without this part of the diff excursions from the configured rate are even larger). This means that there are likely other sources of jitter in the query scheduler that we could attempt to eliminate.

Also note that reducing jitter in this PR results directly in improved query performance in the package (since batches of tasks can now be dispatched more frequently). For some queries consisting of very short running tasks I measured a decrease in query time from ~1minute to ~18seconds in my single-machine docker-compose setup.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ran searches with extra manually inserted instrumentation in the package, and directly measured decrease in jitter as a result of this change.

Summary by CodeRabbit

Bug Fixes

Enhanced job orchestration stability by refining task result retrieval and polling intervals, resulting in more consistent and reliable job status updates regardless of task execution time variations.

Performance

Optimized polling mechanism to maintain uniform check intervals, reducing delays and improving responsiveness in job scheduling operations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-22T20:09:29Z

Walkthrough

The change optimizes polling behaviour in the query scheduler by introducing a timeout parameter to asynchronous task result retrieval and adjusting the polling loop to account for elapsed execution time, maintaining consistent polling intervals.

Changes

Cohort / File(s)	Summary
Query Scheduler Polling Logic `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`	Added `interval=0.005` parameter to `async_task_result.get()` call for non-blocking polling. Modified polling loop to track elapsed time and adjust sleep duration (`jobs_poll_delay - elapsed_time`) to maintain consistent overall poll interval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

#1897 — Directly addresses task-result polling interval optimization and polling loop sleep adjustment in the query scheduler, aligning with the same polling performance objectives.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'perf(clp-package): Reduce batch scheduling jitter in the query scheduler.' accurately reflects the main change—reducing scheduling jitter by adjusting polling intervals and adding timeout parameters to task result retrieval.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 1088-1096: The loop in handle_job_updates can compute a negative
sleep when handle_cancelling_search_jobs and check_job_status_and_update_db
exceed jobs_poll_delay; change the sleep calculation so the duration is clamped
to a non-negative value (e.g., compute remaining = jobs_poll_delay -
(interval_end_time - interval_start_time).total_seconds() and call
asyncio.sleep(max(0, remaining))). Update the sleep call in handle_job_updates
to use this guarded remaining value while keeping
interval_start_time/interval_end_time and jobs_poll_delay as the sources of
timing.

coderabbitai · 2026-01-22T20:12:51Z

 async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):
    while True:
+        interval_start_time = datetime.datetime.now()
        await handle_cancelling_search_jobs(db_conn_pool)
        await check_job_status_and_update_db(db_conn_pool, results_cache_uri)
-        await asyncio.sleep(jobs_poll_delay)
+        interval_end_time = datetime.datetime.now()
+        await asyncio.sleep(
+            jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
+        )


🧹 Nitpick | 🔵 Trivial

Guard against negative sleep duration when work exceeds the polling interval.

If handle_cancelling_search_jobs and check_job_status_and_update_db take longer than jobs_poll_delay, the computed sleep duration becomes negative. While asyncio.sleep() treats negative values as 0, making this explicit improves clarity and intent.

💡 Suggested improvement

async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float): while True: interval_start_time = datetime.datetime.now() await handle_cancelling_search_jobs(db_conn_pool) await check_job_status_and_update_db(db_conn_pool, results_cache_uri) interval_end_time = datetime.datetime.now() + remaining_time = jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds() await asyncio.sleep( - jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds() + max(0, remaining_time) )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):

while True:

interval_start_time = datetime.datetime.now()

await handle_cancelling_search_jobs(db_conn_pool)

await check_job_status_and_update_db(db_conn_pool, results_cache_uri)

await asyncio.sleep(jobs_poll_delay)

interval_end_time = datetime.datetime.now()

await asyncio.sleep(

jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()

)

async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):

while True:

interval_start_time = datetime.datetime.now()

await handle_cancelling_search_jobs(db_conn_pool)

await check_job_status_and_update_db(db_conn_pool, results_cache_uri)

interval_end_time = datetime.datetime.now()

remaining_time = jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()

await asyncio.sleep(

max(0, remaining_time)

)

🧰 Tools

🪛 Ruff (0.14.13)

1090-1090: datetime.datetime.now() called without a tz argument

(DTZ005)

1093-1093: datetime.datetime.now() called without a tz argument

(DTZ005)

🤖 Prompt for AI Agents

In `@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` around lines 1088 - 1096, The loop in handle_job_updates can compute a negative sleep when handle_cancelling_search_jobs and check_job_status_and_update_db exceed jobs_poll_delay; change the sleep calculation so the duration is clamped to a non-negative value (e.g., compute remaining = jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds() and call asyncio.sleep(max(0, remaining))). Update the sleep call in handle_job_updates to use this guarded remaining value while keeping interval_start_time/interval_end_time and jobs_poll_delay as the sources of timing.

The behaviour of asyncio.sleep for negative durations is well-documented, so handling this as a special case in our own code decreases clarity imo.

@gibber9809, that's a fair point. You're right that asyncio.sleep() with delay values of 0 or less is well-defined behavior, and since the function handles it consistently, keeping the code simpler without the explicit max() check is reasonable. Thanks for the clarification!

🧠 Learnings used

Learnt from: haiqi96 Repo: y-scope/clp PR: 569 File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392 Timestamp: 2024-11-15T20:07:22.256Z Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Learnt from: junhaoliao Repo: y-scope/clp PR: 1152 File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613 Timestamp: 2025-08-08T06:59:42.436Z Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR `#1152` discussion.

gibber9809 · 2026-01-22T21:50:53Z

Doing one more experiment to see if I can get the jitter even lower -- noticed on a second look that the task retiring is in the polling loop here, but the task dispatching is in our other polling loop. Going to see if there are any easy improvements to that polling loop that further reduce jitter.

gibber9809 · 2026-01-23T15:30:47Z

Doing one more experiment to see if I can get the jitter even lower -- noticed on a second look that the task retiring is in the polling loop here, but the task dispatching is in our other polling loop. Going to see if there are any easy improvements to that polling loop that further reduce jitter.

Yeah seems like there's potentially at least one or two more obvious improvements worth trying out. Going to temporarily put this PR in draft, then take it out once I've finished trying them.

gibber9809 · 2026-01-26T15:30:51Z

For reference, this is what the overheads look like when I move all of the code related to dispatching batches into the same flow so it's a bit easier to measure.

gibber9809 · 2026-01-26T17:38:35Z

Marking ready for review but removing "fixes #1897" since this PR doesn't completely remove the jitter, it just significantly reduces it.

sitaowang1998 · 2026-01-28T00:04:18Z

    if not async_task_result.ready():
        return None
-    return async_task_result.get()
+    return async_task_result.get(interval=0.005)


I am a little confused about the interval. If the result is already ready, why do we need an interval?

Because celery still polls for the result for some reason and never seems to actually retrieve it on the first request. This polling interval is 0.5 seconds by default, so what was happening previously was that this call would always wait for 0.5 seconds then retrieve the result -- you can see this in the breakdown shown in #1897.

As for why celery works this way, I'm not sure -- some sources online imply that task state (running, failed, successful) is tracked separately from task result payload, which could explain this behaviour.

LinZhihao-723

For the PR title, how about:

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.

Otherwise, I think they all lgtm.

…ler. (y-scope#1899)

Reduce batch scheduling jitter in query_scheduler.

9532782

gibber9809 requested a review from a team as a code owner January 22, 2026 20:09

gibber9809 linked an issue Jan 22, 2026 that may be closed by this pull request

clp-package: Scheduling jitter is much larger than the scheduling interval, resulting in slow search. #1897

Open

gibber9809 changed the title ~~feat(clp-package) Reduce batch scheduling jitter in the query scheduler (fixes #1897).~~ feat(clp-package): Reduce batch scheduling jitter in the query scheduler (fixes #1897). Jan 22, 2026

coderabbitai Bot reviewed Jan 22, 2026

View reviewed changes

junhaoliao added this to the February 2026 milestone Jan 23, 2026

gibber9809 marked this pull request as draft January 23, 2026 15:31

Merge branch 'main' into reduce-scheduling-jitter

25ac8fa

gibber9809 removed a link to an issue Jan 26, 2026

clp-package: Scheduling jitter is much larger than the scheduling interval, resulting in slow search. #1897

Open

gibber9809 changed the title ~~feat(clp-package): Reduce batch scheduling jitter in the query scheduler (fixes #1897).~~ feat(clp-package): Reduce batch scheduling jitter in the query scheduler. Jan 26, 2026

gibber9809 marked this pull request as ready for review January 26, 2026 17:38

gibber9809 mentioned this pull request Jan 26, 2026

clp-package: Scheduling jitter is much larger than the scheduling interval, resulting in slow search. #1897

Open

junhaoliao requested a review from sitaowang1998 January 27, 2026 23:50

junhaoliao modified the milestones: February 2026, January 2026 Jan 27, 2026

junhaoliao assigned gibber9809 and sitaowang1998 Jan 27, 2026

sitaowang1998 reviewed Jan 28, 2026

View reviewed changes

gibber9809 requested a review from sitaowang1998 January 28, 2026 16:03

sitaowang1998 approved these changes Jan 28, 2026

View reviewed changes

gibber9809 added 3 commits January 28, 2026 11:58

Merge branch 'main' into reduce-scheduling-jitter

cb9ffc9

Merge branch 'main' into reduce-scheduling-jitter

34bd521

Merge branch 'main' into reduce-scheduling-jitter

bde9361

junhaoliao requested a review from LinZhihao-723 January 29, 2026 16:32

junhaoliao assigned LinZhihao-723 Jan 30, 2026

LinZhihao-723 approved these changes Jan 30, 2026

View reviewed changes

Merge branch 'main' into reduce-scheduling-jitter

f559bfa

gibber9809 changed the title ~~feat(clp-package): Reduce batch scheduling jitter in the query scheduler.~~ perf(clp-package): Reduce batch scheduling jitter in the query scheduler. Jan 30, 2026

LinZhihao-723 merged commit 6ac2904 into y-scope:main Jan 30, 2026
22 checks passed

junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026

perf(clp-package): Reduce batch scheduling jitter in the query schedu…

31f144c

…ler. (y-scope#1899)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.#1899

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.#1899
LinZhihao-723 merged 6 commits into
y-scope:mainfrom
gibber9809:reduce-scheduling-jitter

gibber9809 commented Jan 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 22, 2026 •

edited

Loading

Uh oh!

gibber9809 Jan 22, 2026

Uh oh!

coderabbitai Bot Jan 22, 2026

Uh oh!

gibber9809 commented Jan 22, 2026

Uh oh!

gibber9809 commented Jan 23, 2026

Uh oh!

gibber9809 commented Jan 26, 2026

Uh oh!

gibber9809 commented Jan 26, 2026

Uh oh!

sitaowang1998 Jan 28, 2026

Uh oh!

gibber9809 Jan 28, 2026 •

edited

Loading

Uh oh!

LinZhihao-723 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gibber9809 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gibber9809 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gibber9809 commented Jan 22, 2026

Uh oh!

gibber9809 commented Jan 23, 2026

Uh oh!

gibber9809 commented Jan 26, 2026

Uh oh!

gibber9809 commented Jan 26, 2026

Uh oh!

sitaowang1998 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gibber9809 Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gibber9809 commented Jan 22, 2026 •

edited

Loading

coderabbitai Bot commented Jan 22, 2026 •

edited

Loading

coderabbitai Bot Jan 22, 2026 •

edited

Loading

gibber9809 Jan 28, 2026 •

edited

Loading