Skip to content

clp-package: Scheduling jitter is much larger than the scheduling interval, resulting in slow search. #1897

@gibber9809

Description

@gibber9809

Bug

Our query scheduler is currently designed to run a scheduling loop on a fixed polling interval, where on each interval we gather finished batches of tasks and dispatch new batches. This design has some fundamental problems, but that will be addressed in a separate design doc going over a proper solution.

The immediate issue, then, is that this scheduling loop currently has so much jitter that the effective rate at which we can dispatch new batches of task for each job is regulated by this jitter instead of the configured polling interval.

The following chart shows the breakdown of how time is spent in the main part of the scheduling loop over a small illustrative time slice of a longer search job as batches of task are completed, and illustrates how task completion affects jitter. (Apologies for the ad-hoc chart).

Image

As you can see from the graph, each time a batch of tasks is completed we spend a suspicious amount of time waiting for results from celery.

As it turns out, this is because even after task.ready() is true, celery still seems to go into a polling loop to retrieve results from redis -- since the default polling interval is 0.5s we seem to always experience this 0.5s delay when retrieving results. Reducing this polling interval directly reduces the time we spend in task.get() (and experimentally even when reduced significantly we seem to always spend ~polling_interval time in get()).

Besides this issue with retrieving the results of celery tasks, we can also reduce jitter by changing how we sleep in our main scheduling loop.

Currently the loop looks something like

while True:
    # do stuff
    await asyncio.sleep(polling_interval)

but to reduce jitter we really want something like

while True:
    # do stuff
    await asyncio.sleep(polling_interval - time_spent_doing_stuff)

CLP version

0.8.0

Environment

Package build started with docker-compose.

Reproduction steps

  1. Compress enough data to form at least a few archives
  2. Make sure the configured batch size is less than the total number of archives
  3. Dispatch any search across all archives (note that because of another issue, command line searches that don't invoke the reducer end up with all tasks in a single batch).

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions