-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Performance issue with multiple levels of nesting of if statements #72721
Description
Hi,
We are having a performance issue with queries which contain complex columns (with if statements or any long expressions) that reference each other. Columns can be constructed from a number of if statements, and then we have dependent columns which re-use these columns. I believe the issue could be due to the way that alias names are constructed - it seems that the alias names are growing exponentially as we chain together more dependencies. If you take the example table / query below (sample generated by ChatGPT), and run in https://fiddle.clickhouse.com/, (current version 24.11.1.2557) the query takes 9 seconds even though the table contains only one row, and it is necessary to increase max_expanded_ast_elements to allow it to run. This is a heavily simplified example - we are looking at options such as materialising some columns but it is not always possible.
CREATE TABLE sample_table (
id UInt64,
value UInt64
)
ENGINE = MergeTree()
ORDER BY id;
-- Insert sample data
INSERT INTO sample_table VALUES (1, 10);
--explain plan actions = 1
WITH
-- Deeply nested IF statements for col1
if(value = 10, 'A',
if(value = 20, 'B',
if(value = 30, 'C',
if(value = 40, 'D',
if(value = 50, 'E', 'F')
)
)
)
) AS col1,
-- Another deeply nested IF statements relying on col1
if(col1 = 'A', 'Alpha',
if(col1 = 'B', 'Beta',
if(col1 = 'C', 'Gamma',
if(col1 = 'D', 'Delta',
if(col1 = 'E', 'Epsilon', 'Other')
)
)
)
) AS col2,
-- Deeply nested IF statements relying on col2
if(col2 = 'Alpha', 1,
if(col2 = 'Beta', 2,
if(col2 = 'Gamma', 3,
if(col2 = 'Delta', 4,
if(col2 = 'Epsilon', 5, 0)
)
)
)
) AS col3,
-- Deeply nested IF statements relying on col3
if(col3 = 1, 'One',
if(col3 = 2, 'Two',
if(col3 = 3, 'Three',
if(col3 = 4, 'Four',
if(col3 = 5, 'Five', 'Zero')
)
)
)
) AS col4,
-- Deeply nested IF statements relying on col4
if(col4 = 'One', 'Uno',
if(col4 = 'Two', 'Dos',
if(col4 = 'Three', 'Tres',
if(col4 = 'Four', 'Cuatro',
if(col4 = 'Five', 'Cinco', 'Cero')
)
)
)
) AS col5,
-- Deeply nested IF statements relying on col5
if(col5 = 'Uno', 'I',
if(col5 = 'Dos', 'II',
if(col5 = 'Tres', 'III',
if(col5 = 'Cuatro', 'IV',
if(col5 = 'Cinco', 'V', 'Other')
)
)
)
) AS col6,
-- Deeply nested IF statements relying on col6
if(col6 = 'I', 'Primero',
if(col6 = 'II', 'Segundo',
if(col6 = 'III', 'Tercero',
if(col6 = 'IV', 'Cuarto',
if(col6 = 'V', 'Quinto', 'Otro')
)
)
)
) AS col7,
-- Deeply nested IF statements relying on col7
if(col7 = 'Primero', 'First',
if(col7 = 'Segundo', 'Second',
if(col7 = 'Tercero', 'Third',
if(col7 = 'Cuarto', 'Fourth',
if(col7 = 'Quinto', 'Fifth', 'Other')
)
)
)
) AS col8
SELECT
id,
col1,
col2,
col3,
col4,
col5,
col6,
col7,
col8
FROM sample_table
settings max_expanded_ast_elements = 5000000
If we run the query, increasing the number of columns each time, we can see performance getting much worse with each new column.
5 cols - 194ms
6 cols - 463ms
7 cols - 1.854s
8 cols - 9.181s
If we run the query with explain plan actions = 1, the aliases at the bottom are absolutely huge - my guess is that this is the cause of the poor performance. My question is, is there a setting which can be applied to prevent this behaviour, or a way to re-write the query to avoid this excessive alias naming? ChatGPT has suggested one approach which does actually work, which involves chaining CTEs together. If there is no better solution we will consider this approach, but I am wondering if there is a better way to do it. The query below runs in 143ms with all 8 columns, but it is obviously a bit cumbersome to write it like this. Thanks for any help you can provide!
CREATE TABLE sample_table (
id UInt64,
value UInt64
)
ENGINE = MergeTree()
ORDER BY id;
-- Insert sample data
INSERT INTO sample_table VALUES (1, 10);
WITH
-- Step 1: Compute col1
col1_cte AS (
SELECT
id,
if(value = 10, 'A',
if(value = 20, 'B',
if(value = 30, 'C',
if(value = 40, 'D',
if(value = 50, 'E', 'F')
)
)
)
) AS col1
FROM sample_table
),
-- Step 2: Compute col2 based on col1
col2_cte AS (
SELECT
id,
col1,
if(col1 = 'A', 'Alpha',
if(col1 = 'B', 'Beta',
if(col1 = 'C', 'Gamma',
if(col1 = 'D', 'Delta',
if(col1 = 'E', 'Epsilon', 'Other')
)
)
)
) AS col2
FROM col1_cte
),
-- Step 3: Compute col3 based on col2
col3_cte AS (
SELECT
id,
col1,
col2,
if(col2 = 'Alpha', 1,
if(col2 = 'Beta', 2,
if(col2 = 'Gamma', 3,
if(col2 = 'Delta', 4,
if(col2 = 'Epsilon', 5, 0)
)
)
)
) AS col3
FROM col2_cte
),
-- Step 4: Compute col4 based on col3
col4_cte AS (
SELECT
id,
col1,
col2,
col3,
if(col3 = 1, 'One',
if(col3 = 2, 'Two',
if(col3 = 3, 'Three',
if(col3 = 4, 'Four',
if(col3 = 5, 'Five', 'Zero')
)
)
)
) AS col4
FROM col3_cte
),
-- Step 5: Compute col5 based on col4
col5_cte AS (
SELECT
id,
col1,
col2,
col3,
col4,
if(col4 = 'One', 'Uno',
if(col4 = 'Two', 'Dos',
if(col4 = 'Three', 'Tres',
if(col4 = 'Four', 'Cuatro',
if(col4 = 'Five', 'Cinco', 'Cero')
)
)
)
) AS col5
FROM col4_cte
),
-- Step 6: Compute col6 based on col5
col6_cte AS (
SELECT
id,
col1,
col2,
col3,
col4,
col5,
if(col5 = 'Uno', 'I',
if(col5 = 'Dos', 'II',
if(col5 = 'Tres', 'III',
if(col5 = 'Cuatro', 'IV',
if(col5 = 'Cinco', 'V', 'Other')
)
)
)
) AS col6
FROM col5_cte
),
-- Step 7: Compute col7 based on col6
col7_cte AS (
SELECT
id,
col1,
col2,
col3,
col4,
col5,
col6,
if(col6 = 'I', 'Primero',
if(col6 = 'II', 'Segundo',
if(col6 = 'III', 'Tercero',
if(col6 = 'IV', 'Cuarto',
if(col6 = 'V', 'Quinto', 'Otro')
)
)
)
) AS col7
FROM col6_cte
),
-- Step 8: Compute col8 based on col7
col8_cte AS (
SELECT
id,
col1,
col2,
col3,
col4,
col5,
col6,
col7,
if(col7 = 'Primero', 'First',
if(col7 = 'Segundo', 'Second',
if(col7 = 'Tercero', 'Third',
if(col7 = 'Cuarto', 'Fourth',
if(col7 = 'Quinto', 'Fifth', 'Other')
)
)
)
) AS col8
FROM col7_cte
)
-- Final Query to Get All Columns
SELECT
id,
col1,
col2,
col3,
col4,
col5,
col6,
col7,
col8
FROM col8_cte