Reclaimable memory management and dynamic spilling thresholds

Thresholds for spilling data to disk should be flexible and based on current memory consumption to avoid exhausting memory with multiple queries, requiring a lot of memory, but capable of spilling data to disk. Unfortunately, it is not enough to just set smart dynamic threshold. If threshold is dynamic it can be lowered (e.g. due to newly arrived queries) below already allocated memory. So we are in a situation where one has to lower memory footprint of already running query dynamically. Technically, it is possible to implement memory reclaiming strategy for several kinds of queries. It will allow to avoid situations in which few queries monopolize all RAM available to CH server and distribute "memory resource" fairly between concurrent queries no matter what query was the first. To make this possible, subsystem similar to overcommit tracker should be implemented. Opposed to overcommit tracker, it will not kill query, but ask it to free memory to make its footprint less or equal to `Limit Before Spill`. Let me describe this new subsystem in a more detailed way.

## Memory model
Every tracked memory region belongs to exactly one of the following classes:
1. `Unreclaimable Memory`. In RAM, cannot be moved to disk or simply discarded on request.
1. `Reclaimable Memory`. In RAM, can be either moved to disk or discarded.
1. `External Memory`. Not in RAM, usually in temporary file on local disk.

Unreclaimable and reclaimable memory constitute `Main Memory`. For simplicity we ignore possibility of multiple external storages and concentrate on just single local disk.

```mermaid
graph TD
    Memory -->
    MM[Main Memory] -->
    RM[Reclaimable Memory]
    MM --> URM[Unreclaimable Memory]
    Memory --> EM[External Memory]
```

Memory is allocated as unreclaimable and can be later marked as reclaimable. Reclaiming process is asynchronous. It means a query cannot get rid of reclaimable memory immediately. It also may require both additional unreclaimable memory and external memory to actually perform spilling to disk. Note that for caches reclaiming can simply mean deallocation without spilling to disk (not sure if it is a good idea to reclaim memory from caches though).

## Hierarchy
ClickHouse already tracks memory consumption in a hierarchical way with `MemoryTracker` and `OvercommitTracker`. New `ReclaimableTracker` should be added to track amount of memory that can be voluntarily freed by a query. Every "component" of a query can tell ReclaimableTracker of a query that it is able to free memory on-demand. Such "component" should periodically/asynchronously check if there is an actual demand and perform it.

Reclaimable trackers are organized in the same hierarchy: global, users, queries, threads. On every tree node we hold two new numbers:
1. Total amount of reclaimable memory. Summed over immediate children.
1. `Limit Before Spill`. Main memory threshold for this node. In mathematical sense it represents max-min fair allocation of main memory based on current footprints (not all queries require whole fair share of memory, so others can use above fair share). Value stored in node also equals sum over immediate children.

```mermaid
graph TD
    Global --> User1 --> Query1 --> Thread1
    Global --> ...UserN
    User1 --> ...QueryM
    Query1 --> ...ThreadK
```

Consider memory allocations of a single thread (hierarchy leaf). As usual allocation goes through `CurrentMemoryTracker` and batched in thread-local variable up to `max_untracked_memory` (4MB) to avoid too frequent expensive interaction with memory tracking system. When thread allocated (or freed) more than 4MB this fact in passed to `MemoryTracker`. This is where all memory limit checks are executed for this node, including overcommit checks. A new check for spilling should be also added here. After checks control is passed to the parent memory tracker recursively.

## Limit Before Spill
The new check compares memory footprint of a node to its `Limit Before Spill`. If limit is exceeded either our node is over its fair share or a sibling node is over its share. To distinguish these cases we report allocation to parent. At first, parent update it's own limit (if parent is also exceeded) and then recalculate spilling limits of all its children. "Progressive filling algorithm" is used to calculate limits:
```js
let total_spill_limit = 700;
let nodes = [
    { name: "A", guaranteed: 100, used: 300 }, // fair allocation is 200
    { name: "B", guaranteed: 200, used: 400 }, // fair allocation is 400
    { name: "C", guaranteed: 100, used: 50 },  // fair allocation is 50
    { name: "D", guaranteed: 100, used: 50 }   // fair allocation is 50
];

let spill_limit_left = total_spill_limit;
let guaranteed_left = 0;
for (const node of nodes)
    guaranteed_left += node.guaranteed;

// Progressive filling algorithm provides max-min fair allocations:
//          _
//       __| |    width = node.guarantee (weight; aka soft limit in overcommit tracker)
//      |██|█|   height = node.used / node.guarantee; so area = node.used
//  |▄|▄|██|█|   filled = node.limit
//   C D BB A
nodes.sort((a, b) => (a.used / a.guaranteed) - (b.used / b.guaranteed)); // Sort by `OvercommitRatio`
for (let node of nodes) {
    let fair_share = spill_limit_left * node.guaranteed / guaranteed_left;
    node.limit = Math.min(fair_share, node.used);
    spill_limit_left -= node.limit;
    guaranteed_left -= node.guaranteed;
    console.log(`${node.name} limit before spill is ${node.limit}`);
}
```
There are important implications from this algorithm:
1. Limit can be greater than memory footprint of a node if and only if overall CH server memory usage is lower that global spilling limit.
1. Limit equals memory footprint of a node if everything is good for node and no spilling is actually needed, and even more memory can be allocated (limit will be also increased if it is fair). Note that one cannot know in advance would new allocation exceed spilling limit or not, so it is natural to be optimistic and check limit after actual allocation.
1. This algorithm does not distinguish main memory as reclaimable or not. It just compute fair memory shares. Node can either be able to lower its memory footprint by reclaiming memory or not be able to do it (e.g. it has no reclaimable memory at all). In latter case if query will continue to allocate more memory it will be eventually killed by overcommit tracker.
1. Overcommit tracker must be used in conjunction with reclaimable tracker and `guarantee` must equals `soft_limit`.

## Limiting concurrency based on memory consumption
In some situations it can be advantageous to avoid even starting new queries, and just let current queries to complete. This is especially true for data ingestion. It is proposed to address this problem by introducing `Memory Demand` feature. Note that it is not the part of this issue, and should be implemented separately. But I'll shortly describe main idea here, because it is related to the topic and important for understanding of a big picture. Every query should estimate `Memory Demand`: how much memory it will require during execution (simple constant to begin with). Demands of concurrent queries are summed to a total and server will block arriving queries with some timeout if sum will exceed `Guaranteed Memory Limit`. This is similar to `max_concurrent_queries_for_all_users` setting, but will operate with memory demand instead of simple query count. This new feature will help to avoid situation when too many queries are running concurrently and are slowed down due to memory pressure caused by reclaimable trackers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reclaimable memory management and dynamic spilling thresholds #41887

Memory model

Hierarchy

Limit Before Spill

Limiting concurrency based on memory consumption

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reclaimable memory management and dynamic spilling thresholds #41887

Description

Memory model

Hierarchy

Limit Before Spill

Limiting concurrency based on memory consumption

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions