Workload scheduling: Memory reservations by serxa · Pull Request #82414 · ClickHouse/ClickHouse

serxa · 2025-06-23T12:51:36Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Introduce a memory reservation feature for workloads. More details https://clickhouse.com/docs/operations/workload-scheduling

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Details

To enable memory reservations for workloads create MEMORY RESERVATION resource and set at least one limit for the total memory reserved using workload settings:

CREATE RESOURCE memory (MEMORY RESERVATION)
CREATE WORKLOAD all SETTINGS max_memory = '2Gi'

ClickHouse tracks memory allocations of all queries and background activities. The number of allocated bytes is aggregated through the scheduling hierarchy up to the root. Every query has an associated allocation in the leaf workload it belongs to. If a query has the reserve_memory setting greater than zero, then the allocation is created in a pending state. Pending allocation reserves requested amount of memory in the workload hierarchy. If there is not enough memory available, the allocation remains pending until enough memory is freed or other allocations are evicted (killed). When allocation is admitted, it becomes running. Running allocation could increase or decrease its size dynamically according to memory consumption of the query. Allocation life-cycle can be depicted with the following state diagram:

stateDiagram-v2
    [*] --> Pending: init [reserve_memory > 0]
    [*] --> Running: init [reserve_memory == 0]

    Pending --> Running: admit

    state Running {
        %% Region 1: increase flow
        NotIncreasing --> Increasing: request
        Increasing --> NotIncreasing: approve

        --

        %% Region 2: decrease flow
        NotDecreasing --> Decreasing: request
        Decreasing --> NotDecreasing: approve
    }


    Running --> Killed: evict
    Running --> Released: finish

Pending allocations of a leaf workload are admitted according to FIFO order. When multiple workloads have pending allocations, they are admitted according to precedence and weight settings. Higher precedence workloads are served first. Sibling workloads with the same precedence share memory according to weights in a max-min fair manner, which means that workload with lower normalized memory usage (current usage plus requested increase divided by weight) is served first. The reverse logic is applied during eviction. When memory needs to be freed, workloads with lower precedence and higher normalized memory usage are evicted first.

Note that time-shared resources use priority, while space-shared resources use precedence. They are independent settings and could be set to different values. Higher priority implies non-destructive preemption (delay or throttling), while higher precedence may imply destructive eviction (stops with an error). A workload could have high priority for CPU scheduling, but the same precedence for memory reservation to avoid evicting other workloads and losing work that was already done by them.

Every workload with a max_memory limit ensures that the total memory allocated in its subtree does not exceed the limit. If a pending or increasing allocation would exceed the limit, eviction procedure is initiated to free memory. Eviction procedure selects a victim to be killed. The least common ancestor workload of killer and victim prevents eviction in the following situations:

Pending allocation cannot evict running allocations in the same workload. (Killer and victim workloads coincide).
Pending allocation of lower precedence never kills workload of higher precedence.
Pending allocation cannot kill an allocation of the same precedence. Note that running allocations of the same precedence may evict each other based on normalized memory usage.
If eviction is prevented or does not free enough memory, the new allocation is blocked until enough memory is freed. These rules allow queueing of excessive queries based on memory pressure and provide a convenient way to avoid MEMORY_LIMIT_EXCEEDED errors.

NOTE: Workload limits are independent from other ways to limit memory consumption like max_memory_usage query setting. They could be used together to achieve better control over memory consumption. It is possible to set independent memory limits based on users (not workloads). This is less flexible and does not provide features like memory reservation and queueing of pending queries. See Memory overcommit

Workload setting max_waiting_queries limits the number of pending allocations for the workload. When the limit is reached, the server returns an error SERVER_OVERLOADED.

Memory reservation scheduling is not supported for merges and mutations yet.

Only queries with the reserve_memory setting greater than zero are subject to blocking while waiting for memory reservation. However, queries with zero reserve_memory are also accounted for in their workload memory footprint, and they can be evicted if necessary to free memory for other pending or increasing allocations. Queries without proper workload markup are not subject to memory reservation scheduling and cannot be evicted by the scheduler.

To provide non-elastic memory reservation for a query, set both reserve_memory and max_memory_usage query settings to the same value. In this case, the query will reserve fixed amount of memory and will not be able to increase its allocation dynamically.

Let's consider an example of configuration:

CREATE RESOURCE memory (MEMORY RESERVATION)
CREATE WORKLOAD all SETTINGS max_memory = '10Gi'
CREATE WORKLOAD system IN all SETTINGS weight = 1
CREATE WORKLOAD user IN all SETTINGS weight = 9
CREATE WORKLOAD production IN user SETTINGS precedence = 1, weight = 3
CREATE WORKLOAD staging IN user SETTINGS precedence = 1, weight = 1
CREATE WORKLOAD testing IN user SETTINGS precedence = 2

In this example, the total memory reserved by all queries and background activities cannot exceed 10 GiB. The system workload has a guarantee of at least 1 GiB (10% of 10 GiB), while the user workload has a guarantee of at least 9 GiB (90% of 10 GiB). Inside the user workload, production and staging workloads share memory according to weights (3 to 1) with equal precedence of 1. Testing workload has precedence 2, which is lower than production and staging. Therefore, testing workload can only use memory that is not used by production and staging.

If memory pressure arises, testing workload allocations will be evicted first. Then, if more memory needs to be freed, staging workload allocations will be evicted before production workload allocations if they exceed their guarantees. Note that pending queries in production and staging can evict running allocations in testing workload to free memory, but they cannot evict each other because they have the same precedence. In case of memory pressure, they will wait in queues, which allows the system to avoid MEMORY_LIMIT_EXCEEDED errors due to too many concurrently executing queries.

Note that system workload has precedence 0 (default), which is higher than production, staging and testing workloads, but they are not sibling workload. The least common ancestor is workload all, both children of which has equal precedence. So pending system workload cannot evict any of them, and vice versa. This ensures that system activities cannot easily be evicted.

clickhouse-gh · 2025-06-23T12:52:03Z

Workflow [PR], commit [3a2240a]

Summary: ❌

job_name	test_name	status	info
Fast test		failure
	02995_new_settings_history	FAIL	cidb
	clickhouse-test	FAIL	cidb
Docs check		dropped
Build (amd_debug)		dropped
Build (amd_asan_ubsan)		dropped
Build (amd_tsan)		dropped
Build (amd_msan)		dropped
Build (amd_binary)		dropped
Build (arm_asan_ubsan)		dropped
Build (arm_binary)		dropped
Build (amd_release)		dropped

clickhouse-gh · 2025-08-26T13:19:32Z

Dear @serxa, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

clickhouse-gh · 2025-10-28T13:19:27Z

Dear @serxa, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

…rces

…g workloads

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

azat

I did not look through all the scheduler code (maybe I will get back), but I've already spend quite some time (first draft does not contains 7K LOC 😂 ) for this PR - and in general it looks good to me:

interactions of PipelineExecutor is simple
the interface changes looks clear, splitted into time/space shared
I have few concerns about new implementation for memory, mostly around locking (you will find it below), please take a look

Also I think:

we should enable workloads on CI to catch bugs
enable it for perf tests to measure the overhead (though maybe they will not be able to catch the difference)

src/Common/Scheduler/WorkloadSettings.cpp

tests/integration/test_scheduler_memory/test.py

src/Common/Scheduler/ResourceAllocation.h

src/Common/Scheduler/Nodes/TimeShared/SemaphoreConstraint.h

src/Common/Scheduler/ISpaceSharedNode.h

src/Common/Scheduler/IWorkloadNode.h

src/Common/Scheduler/MemoryReservation.cpp

src/Common/Scheduler/Nodes/SpaceShared/AllocationQueue.cpp

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>

…ickHouse into workload-memory-scheduling

src/Common/Scheduler/MemoryReservation.cpp

src/Common/Scheduler/Nodes/SpaceShared/AllocationQueue.cpp

…ions Previously `MemoryReservation::increaseApproved` and `decreaseApproved` called `syncWithScheduler` which re-entered `AllocationQueue` during hierarchy traversal. This caused lock order inversions and required a `recursive_mutex` on `AllocationQueue`. Key changes: - Remove `syncWithScheduler` from `MemoryReservation` and `syncSize` from `TestAllocation`. Callbacks now just update state and notify via `cv.notify_all`. - Add `IAllocationQueue::removeAllocation` to handle allocation removal on the scheduler thread (cancels pending increase, prepares decrease to zero). Both `MemoryReservation` and `TestAllocation` destructors use this instead of `decreaseAllocation`. - Add serialization in `syncWithMemoryTracker`: block all threads while an increase is pending, ensuring at most one in flight at a time. - Decouple decrease from removal: `decreaseAllocation` never removes an allocation (it may stay alive at zero). Only `removeAllocation` sets `removing_allocation=true`. Running sets at all hierarchy levels use `allocations` count instead of `allocated` amount. - Change `AllocationQueue::mutex` from `recursive_mutex` to `std::mutex`. - Remove `isInSchedulerOrStopped` guards from `increaseAllocation` and `decreaseAllocation` (no longer needed without re-entry). - Fix `FairAllocation` and `PrecedenceAllocation` destructors to properly detach children (was `= default`, caused `!parent` assertion). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

src/QueryPipeline/BlockIO.cpp

- Fix typo "reqeuested" -> "requested" in `ResourceAllocation.h` - Fix brace style in `ITimeSharedNode.h`, `WorkloadSettings.cpp`, `gtest_workload_resource_manager.cpp` (Allman style) - Return false for non-memory units in `WorkloadSettings::hasAllocationLimit` - Catch `MEMORY_RESERVATION_KILLED` specifically in space-shared eviction test - Replace `std::ostringstream` with `fmt::format` in `ISpaceSharedNode::Update::toString` - Poll `system.processes` instead of `sleep` in integration test - Fix `demand_increment` metric leak in `allocationFailed`: store the enqueued demand amount in a dedicated field so the exact value is subtracted on failure, instead of recomputing from potentially stale `actual_size - allocated_size` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… workload-memory-scheduling

- Change `ISchedulerNode::getTypeName` return type from `const String &` to `std::string_view`, eliminating static String objects in all 14 overrides across time-shared and space-shared nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-20T17:59:02Z

src/Common/Scheduler/MemoryReservation.cpp

+            return;
+        }
+        actual_size = 0;
+    }


MemoryReservation::~MemoryReservation sets actual_size = 0 under mutex, unlocks, and only then calls queue.removeAllocation. In that unlocked window, another query thread can enter syncWithMemoryTracker, recompute new_actual_size from MemoryTracker, and write a non-zero actual_size again.

This creates a race in destruction/teardown and can enqueue a stale increase/decrease request while removal is in progress, which may corrupt reservation accounting.

Please guard syncWithMemoryTracker from running after destruction starts (for example by setting a being_destroyed flag under mutex in the destructor and returning early in syncWithMemoryTracker), or otherwise keep teardown and actual_size transitions serialized until removeAllocation is fully committed.

…ption In `onCancelOrConnectionLoss` and `onException`, `releaseWorkloadResources` was called before `resetPipeline`. This destroys the `MemoryReservation` while pipeline threads still hold raw pointers to it (stored in `PipelineExecutor::WorkloadResources`) and may call `syncWithMemoryTracker` between processor executions, leading to use-after-free. The normal finish path (`onFinish`) is not affected because the pipeline has already completed execution by that point. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-20T18:25:35Z

src/Common/Scheduler/Nodes/SpaceShared/AllocationQueue.cpp

+        allocation.allocationFailed(std::make_exception_ptr(
+            Exception(ErrorCodes::INVALID_SCHEDULER_NODE,
+                "Queue for pending allocation is about to be destructed")));
+


AllocationQueue::purgeQueue fails only pending_allocations, then clears removing_allocations without notifying the allocation objects.

A MemoryReservation destructor that already called queue.removeAllocation waits on cv.wait(... removed || fail_reason ...), and when its node is dropped from removing_allocations here, neither removed nor fail_reason is guaranteed to be set. This can hang teardown indefinitely under queue purge/detach races.

Please fail (or complete) all entries in removing_allocations before clearing the container, similarly to pending_allocations, so waiting destructors are always released.

Exercises the `onCancelOrConnectionLoss` / `onException` code path by starting queries that allocate memory (triggering `syncWithMemoryTracker` in pipeline threads) and killing them mid-execution. Repeated 10 times to increase the chance of catching teardown ordering issues under TSan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eduling

rework scheduling nodes to support size-based allocations

f550d33

serxa marked this pull request as draft June 23, 2025 12:52

serxa changed the title ~~[WIP~~ [WIP] Workload memory scheduling Jun 23, 2025

serxa changed the title ~~[WIP] Workload memory scheduling~~ [WIP] Workload scheduling: memory reservations Jun 24, 2025

serxa changed the title ~~[WIP] Workload scheduling: memory reservations~~ [WIP] Workload scheduling: Memory reservations Jun 24, 2025

serxa mentioned this pull request Aug 3, 2025

[Umbrella] Workload Scheduling #84976

Open

29 tasks

Merge branch 'master' into workload-memory-scheduling

eddea07

serxa added 5 commits November 11, 2025 17:04

Major rework of the scheduler subsystem to support space-shared resou…

95de350

…rces

fix unit tests build

f3e9b6e

fix construction of WorkloadNode

759ecf1

name unit test files properly

a3c3ad8

Merge branch 'master' into workload-memory-scheduling

e005772

clickhouse-gh bot added the pr-feature Pull request with new product feature label Nov 11, 2025

serxa added 2 commits November 13, 2025 19:46

Merge branch 'master' into workload-memory-scheduling

6869d3a

tests and bugfixes

89e3dc5

azat self-assigned this Nov 16, 2025

serxa added 11 commits November 17, 2025 13:05

fix: proper max-min fairness - use allocted + increase.size as key

30203c2

fix: pending allocation canceling logic, add tests

3361fe4

refactoring: split AllocationQueue into source and header files

f6f4855

refactoring: split AllocationLimit into source and header files

6e7fac0

test: proper ordering of request approvals

9647e43

add new space-shared scheduler node FairAllocation

aafacef

fix arm build

f5e41f9

fix: style

5fa3aaa

test: fair ordering of increases and pending allocations among siblin…

98a1c3c

…g workloads

Merge branch 'master' into workload-memory-scheduling

f07c127

test: max-min fairness with weights among sibling workloads

7bdb126

serxa and others added 4 commits March 7, 2026 12:57

remove trash

6e6194e

remove trash

5a81531

remove trash

e425c40

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

restore contrib to match master after accidental revert

8b09b34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

serxa removed the submodule changed At least one submodule changed in this PR. label Mar 7, 2026

serxa added 5 commits March 7, 2026 14:06

fix: style

01025ae

rework skip_activation logic in AllocationQueue

0e761a8

fix: lock order inversion in MemoryReservation

78dc026

fix: lock order inversion in TestAllocation

f22b057

fix: proper destruction in unittest

1496124

azat reviewed Mar 10, 2026

View reviewed changes

src/Common/Scheduler/Nodes/SpaceShared/AllocationQueue.cpp Show resolved Hide resolved

serxa and others added 3 commits March 10, 2026 22:04

Update src/Common/Scheduler/WorkloadSettings.cpp

c09676c

Co-authored-by: Azat Khuzhin <a3at.mail@gmail.com>

Merge branch 'workload-memory-scheduling' of github.com:ClickHouse/Cl…

ab37b04

…ickHouse into workload-memory-scheduling

Merge branch 'master' into workload-memory-scheduling

ea6e33a

clickhouse-gh bot reviewed Mar 13, 2026

View reviewed changes

src/Common/Scheduler/MemoryReservation.cpp Outdated Show resolved Hide resolved

clickhouse-gh bot reviewed Mar 13, 2026

View reviewed changes

src/Common/Scheduler/Nodes/SpaceShared/AllocationQueue.cpp Show resolved Hide resolved

serxa and others added 3 commits March 17, 2026 12:57

fix: eviction message

fc825dd

Merge branch 'master' into workload-memory-scheduling

00b0884

clickhouse-gh bot reviewed Mar 20, 2026

View reviewed changes

src/QueryPipeline/BlockIO.cpp Outdated Show resolved Hide resolved

serxa and others added 3 commits March 20, 2026 13:09

Merge remote-tracking branch 'origin/workload-memory-scheduling' into…

8d0339f

… workload-memory-scheduling

clickhouse-gh bot reviewed Mar 20, 2026

View reviewed changes

serxa and others added 2 commits March 20, 2026 19:02

Merge remote-tracking branch 'origin/master' into workload-memory-sch…

3a2240a

…eduling

azat mentioned this pull request Apr 1, 2026

Limit max_threads and max_insert_threads based on available free memory #100383

Open

1 task

Conversation

serxa commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Details

Uh oh!

clickhouse-gh bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Aug 26, 2025

Uh oh!

clickhouse-gh bot commented Oct 28, 2025

Uh oh!

azat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

serxa commented Jun 23, 2025 •

edited

Loading

clickhouse-gh bot commented Jun 23, 2025 •

edited

Loading

azat left a comment •

edited

Loading