Skip to content

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.#1899

Merged
LinZhihao-723 merged 6 commits into
y-scope:mainfrom
gibber9809:reduce-scheduling-jitter
Jan 30, 2026
Merged

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.#1899
LinZhihao-723 merged 6 commits into
y-scope:mainfrom
gibber9809:reduce-scheduling-jitter

Conversation

@gibber9809

@gibber9809 gibber9809 commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

Description

The query scheduler has extreme jitter in the main task dispatch loop, making task dispatch frequency a direct function of this jitter as described in #1897.

This PR significantly reduces the jitter, bringing the pacing of the main scheduling loop much closer to the configured polling interval, and likewise pulling the maximum task dispatch rate closer to the polling interval. The following chart shows the jitter and breakdown of time spent over a small time slice of a search query as batches of search tasks are completed.

ReducedJitter

When compared to the chart in #1897, you can see that the main source of the jitter (long waits for retrieving celery results) has been significantly reduced, and pacing for the main scheduling loop is much closer to the configured rate (configured to once every 0.1 seconds in this case).

Note that while the pacing still isn't perfect, the part of this diff that attempts to explicitly pace the scheduling loop does help (without this part of the diff excursions from the configured rate are even larger). This means that there are likely other sources of jitter in the query scheduler that we could attempt to eliminate.

Also note that reducing jitter in this PR results directly in improved query performance in the package (since batches of tasks can now be dispatched more frequently). For some queries consisting of very short running tasks I measured a decrease in query time from ~1minute to ~18seconds in my single-machine docker-compose setup.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ran searches with extra manually inserted instrumentation in the package, and directly measured decrease in jitter as a result of this change.

Summary by CodeRabbit

Bug Fixes

  • Enhanced job orchestration stability by refining task result retrieval and polling intervals, resulting in more consistent and reliable job status updates regardless of task execution time variations.

Performance

  • Optimized polling mechanism to maintain uniform check intervals, reducing delays and improving responsiveness in job scheduling operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@gibber9809 gibber9809 requested a review from a team as a code owner January 22, 2026 20:09
@coderabbitai

coderabbitai Bot commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

The change optimizes polling behaviour in the query scheduler by introducing a timeout parameter to asynchronous task result retrieval and adjusting the polling loop to account for elapsed execution time, maintaining consistent polling intervals.

Changes

Cohort / File(s) Summary
Query Scheduler Polling Logic
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Added interval=0.005 parameter to async_task_result.get() call for non-blocking polling. Modified polling loop to track elapsed time and adjust sleep duration (jobs_poll_delay - elapsed_time) to maintain consistent overall poll interval.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

  • #1897 — Directly addresses task-result polling interval optimization and polling loop sleep adjustment in the query scheduler, aligning with the same polling performance objectives.
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'perf(clp-package): Reduce batch scheduling jitter in the query scheduler.' accurately reflects the main change—reducing scheduling jitter by adjusting polling intervals and adding timeout parameters to task result retrieval.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gibber9809 gibber9809 changed the title feat(clp-package) Reduce batch scheduling jitter in the query scheduler (fixes #1897). feat(clp-package): Reduce batch scheduling jitter in the query scheduler (fixes #1897). Jan 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`:
- Around line 1088-1096: The loop in handle_job_updates can compute a negative
sleep when handle_cancelling_search_jobs and check_job_status_and_update_db
exceed jobs_poll_delay; change the sleep calculation so the duration is clamped
to a non-negative value (e.g., compute remaining = jobs_poll_delay -
(interval_end_time - interval_start_time).total_seconds() and call
asyncio.sleep(max(0, remaining))). Update the sleep call in handle_job_updates
to use this guarded remaining value while keeping
interval_start_time/interval_end_time and jobs_poll_delay as the sources of
timing.

Comment on lines 1088 to +1096
async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):
while True:
interval_start_time = datetime.datetime.now()
await handle_cancelling_search_jobs(db_conn_pool)
await check_job_status_and_update_db(db_conn_pool, results_cache_uri)
await asyncio.sleep(jobs_poll_delay)
interval_end_time = datetime.datetime.now()
await asyncio.sleep(
jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
)

@coderabbitai coderabbitai Bot Jan 22, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Guard against negative sleep duration when work exceeds the polling interval.

If handle_cancelling_search_jobs and check_job_status_and_update_db take longer than jobs_poll_delay, the computed sleep duration becomes negative. While asyncio.sleep() treats negative values as 0, making this explicit improves clarity and intent.

💡 Suggested improvement
 async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):
     while True:
         interval_start_time = datetime.datetime.now()
         await handle_cancelling_search_jobs(db_conn_pool)
         await check_job_status_and_update_db(db_conn_pool, results_cache_uri)
         interval_end_time = datetime.datetime.now()
+        remaining_time = jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
         await asyncio.sleep(
-            jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
+            max(0, remaining_time)
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):
while True:
interval_start_time = datetime.datetime.now()
await handle_cancelling_search_jobs(db_conn_pool)
await check_job_status_and_update_db(db_conn_pool, results_cache_uri)
await asyncio.sleep(jobs_poll_delay)
interval_end_time = datetime.datetime.now()
await asyncio.sleep(
jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
)
async def handle_job_updates(db_conn_pool, results_cache_uri: str, jobs_poll_delay: float):
while True:
interval_start_time = datetime.datetime.now()
await handle_cancelling_search_jobs(db_conn_pool)
await check_job_status_and_update_db(db_conn_pool, results_cache_uri)
interval_end_time = datetime.datetime.now()
remaining_time = jobs_poll_delay - (interval_end_time - interval_start_time).total_seconds()
await asyncio.sleep(
max(0, remaining_time)
)
🧰 Tools
🪛 Ruff (0.14.13)

1090-1090: datetime.datetime.now() called without a tz argument

(DTZ005)


1093-1093: datetime.datetime.now() called without a tz argument

(DTZ005)

🤖 Prompt for AI Agents
In
`@components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`
around lines 1088 - 1096, The loop in handle_job_updates can compute a negative
sleep when handle_cancelling_search_jobs and check_job_status_and_update_db
exceed jobs_poll_delay; change the sleep calculation so the duration is clamped
to a non-negative value (e.g., compute remaining = jobs_poll_delay -
(interval_end_time - interval_start_time).total_seconds() and call
asyncio.sleep(max(0, remaining))). Update the sleep call in handle_job_updates
to use this guarded remaining value while keeping
interval_start_time/interval_end_time and jobs_poll_delay as the sources of
timing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour of asyncio.sleep for negative durations is well-documented, so handling this as a special case in our own code decreases clarity imo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibber9809, that's a fair point. You're right that asyncio.sleep() with delay values of 0 or less is well-defined behavior, and since the function handles it consistently, keeping the code simpler without the explicit max() check is reasonable. Thanks for the clarification!


🧠 Learnings used
Learnt from: haiqi96
Repo: y-scope/clp PR: 569
File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:380-392
Timestamp: 2024-11-15T20:07:22.256Z
Learning: The current implementation assumes single-threaded execution, so race conditions in functions like `is_target_extracted` in `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py` are not currently a concern.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1152
File: components/clp-package-utils/clp_package_utils/scripts/start_clp.py:613-613
Timestamp: 2025-08-08T06:59:42.436Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/start_clp.py, generic_start_scheduler sets CLP_LOGGING_LEVEL using clp_config.query_scheduler.logging_level for both schedulers; compression scheduler should use its own logging level. Tracking via an issue created from PR `#1152` discussion.

@gibber9809

Copy link
Copy Markdown
Contributor Author

Doing one more experiment to see if I can get the jitter even lower -- noticed on a second look that the task retiring is in the polling loop here, but the task dispatching is in our other polling loop. Going to see if there are any easy improvements to that polling loop that further reduce jitter.

@junhaoliao junhaoliao added this to the February 2026 milestone Jan 23, 2026
@gibber9809

Copy link
Copy Markdown
Contributor Author

Doing one more experiment to see if I can get the jitter even lower -- noticed on a second look that the task retiring is in the polling loop here, but the task dispatching is in our other polling loop. Going to see if there are any easy improvements to that polling loop that further reduce jitter.

Yeah seems like there's potentially at least one or two more obvious improvements worth trying out. Going to temporarily put this PR in draft, then take it out once I've finished trying them.

@gibber9809 gibber9809 marked this pull request as draft January 23, 2026 15:31
@gibber9809

Copy link
Copy Markdown
Contributor Author
ReducedJitterMoreDetail For reference, this is what the overheads look like when I move all of the code related to dispatching batches into the same flow so it's a bit easier to measure.

@gibber9809 gibber9809 changed the title feat(clp-package): Reduce batch scheduling jitter in the query scheduler (fixes #1897). feat(clp-package): Reduce batch scheduling jitter in the query scheduler. Jan 26, 2026
@gibber9809

Copy link
Copy Markdown
Contributor Author

Marking ready for review but removing "fixes #1897" since this PR doesn't completely remove the jitter, it just significantly reduces it.

if not async_task_result.ready():
return None
return async_task_result.get()
return async_task_result.get(interval=0.005)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little confused about the interval. If the result is already ready, why do we need an interval?

@gibber9809 gibber9809 Jan 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because celery still polls for the result for some reason and never seems to actually retrieve it on the first request. This polling interval is 0.5 seconds by default, so what was happening previously was that this call would always wait for 0.5 seconds then retrieve the result -- you can see this in the breakdown shown in #1897.

As for why celery works this way, I'm not sure -- some sources online imply that task state (running, failed, successful) is tracked separately from task result payload, which could explain this behaviour.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

perf(clp-package): Reduce batch scheduling jitter in the query scheduler.

Otherwise, I think they all lgtm.

@gibber9809 gibber9809 changed the title feat(clp-package): Reduce batch scheduling jitter in the query scheduler. perf(clp-package): Reduce batch scheduling jitter in the query scheduler. Jan 30, 2026
@LinZhihao-723 LinZhihao-723 merged commit 6ac2904 into y-scope:main Jan 30, 2026
22 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants