Skip to content

Add arrayAutocorrelation function#94776

Merged
nihalzp merged 29 commits intoClickHouse:masterfrom
wenyuchen96:feat/array-autocorrelation
Apr 2, 2026
Merged

Add arrayAutocorrelation function#94776
nihalzp merged 29 commits intoClickHouse:masterfrom
wenyuchen96:feat/array-autocorrelation

Conversation

@wenyuchen96
Copy link
Copy Markdown
Contributor

@wenyuchen96 wenyuchen96 commented Jan 22, 2026

Relates to #87209

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add function arrayAutocorrelation(arr [, max_lag]) that computes the normalized autocorrelation of a numeric array for each lag. Supports integer, float, and decimal array types.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)
  • Motivation: ClickHouse's aggregation functions already support linear/logistic regression, their stochastic versions, and some statistical tests. However, this is not enough for frequent scenarios of building predictive models, which is why we have to build ETL processes. To improve performance, it would be beneficial to move the implementation of at least one complete use case directly into the DBMS (like ELT). The Auto Correlation Function (ACF) is essential for time-series analysis within ClickHouse. It allows users to detect seasonality, periodic patterns, and trends in sequential data (like stock prices, server metrics, or sensor readings) without needing to export data to Python or R.

  • Parameters: arr: An array of numeric values (Array(Int*) or Array(Float*)). The function automatically casts integers to Float64 for high-precision calculation.

  • Example use: SELECT arrayAutocorrelation([10, 20, 30, 10, 20, 30]);

  • Result: [1, -0.25, -0.5, 0.5, 0, -0.25]

  • (Index 0 is Lag 0, Index 1 is Lag 1, etc.)

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 22, 2026

CLA assistant check
All committers have signed the CLA.

@nihalzp nihalzp added the can be tested Allows running workflows for external contributors label Jan 22, 2026
@nihalzp nihalzp self-assigned this Jan 22, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 22, 2026

Workflow [PR], commit [65bc03c]

Summary:

job_name test_name status info comment
Stress test (arm_msan) failure
Server died FAIL cidb IGNORED
MemorySanitizer: use-of-uninitialized-value (STID: 2410-47e0) FAIL cidb IGNORED

AI Review

Summary

This PR adds arrayAutocorrelation with optional max_lag, extends numeric type coverage (including decimals via conversion), handles const-array execution paths, and adds broad stateless test coverage for type dispatch and edge cases. I did not find any new high-confidence correctness, safety, concurrency, performance, or compatibility issues on the current PR head.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Jan 22, 2026
@wenyuchen96
Copy link
Copy Markdown
Contributor Author

@nihalzp I've updated the code for the checks. The current failed checks (BuzzHouse and Stateless tests) seems unrelated flaky tests. Ready for a final look when you have a chance!

@nihalzp
Copy link
Copy Markdown
Member

nihalzp commented Jan 24, 2026

@wenyuchen96 Thanks! I will have a look next week.

Copy link
Copy Markdown
Member

@nihalzp nihalzp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the feature! Most of it looks good!

I noticed that the time complexity of the function per-row is O(n^2) where n is the size of the array. We should mention that in the documentation. Additionally, the other issue is that if the array contains millions of elements, which is not very uncommon, the function will never finish executing.

There are some advanced O(nlogn) algorithms but they may be bit complicated which we can add later as an improvement. However, since normally users are only interested in the result of first few lags, we can offer them the signature arrayAutocorrelation(array, max_lag); then the time complexity will become O(n * max_lag). We can keep the max_lag optional and if it's not given then we compute result for all lags. In that way, if max_lag is small, our function will be still be useful even if array contains millions of elements.

@wenyuchen96
Copy link
Copy Markdown
Contributor Author

Thanks for the review @nihalzp! The suggestions make sense and I'll keep you posted for the update in the next few days.

@wenyuchen96
Copy link
Copy Markdown
Contributor Author

@nihalzp updates pushed! I've addressed the comments as well. Apologies it's been taking some time. All relevant CI checks are passing now. Ready for another look when you have a moment.

Copy link
Copy Markdown
Member

@nihalzp nihalzp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Previous issues are mostly resolved. I found some queries that do not work as expected:

If we mix const column (first argument) and non-const column for second argument, it cannot handle it. You can have a look at src/Functions/array/arrayDotProduct.cpp for an example of how to handle when one of the argument is constant.

SELECT arrayAutocorrelation([1,2,3], number) FROM numbers(3);

Gives:

Code: 44. DB::Exception: Argument must be an array: while executing function arrayAutocorrelation on arguments _CAST([1, 2, 3]_Array(UInt8), 'Array(UInt8)'_String) Array(UInt8) Const(size = 0, Array(size = 1, UInt64(size = 1), UInt8(size = 3))), __table1.number UInt64 UInt64(size = 0). (ILLEGAL_COLUMN)

Since we read max_lag via getInt(i), if we pass a big UInt64, we will wrap around and produce incorrect result silently.

SELECT arrayAutocorrelation([1, 2, 3], toUInt64('9223372036854775808'))

Gives incorrectly []. You can look at src/Functions/repeat.cpp for an example of how we can get the numbers column properly.

For the following query, I was expecting it to return [] because the rule seems to be output array length = min(array_size, max_lag) . But returns [nan].

SELECT arrayAutocorrelation([1], 0);

@nihalzp nihalzp changed the title feat(Functions): add arrayAutocorrelation function Add arrayAutocorrelation function Apr 2, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 2, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% (739,409/879,098) 84.00% (739,111/879,421) -0.10% (-298)
Functions 90.90% (798,326/878,459) 90.90% (798,325/878,521) +0.00% (-1)
Branches 76.60% (239,647/312,678) 76.60% (239,572/312,834) +0.00% (-75)

Changed lines: 81.73% (264/323) · Uncovered code

Full report · Diff report

@nihalzp nihalzp added this pull request to the merge queue Apr 2, 2026
Merged via the queue into ClickHouse:master with commit 514997a Apr 2, 2026
161 of 163 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants