Add arrayAutocorrelation function#94776
Conversation
|
Workflow [PR], commit [65bc03c] Summary: ❌
AI ReviewSummaryThis PR adds ClickHouse Rules
Final Verdict
|
|
@nihalzp I've updated the code for the checks. The current failed checks (BuzzHouse and Stateless tests) seems unrelated flaky tests. Ready for a final look when you have a chance! |
|
@wenyuchen96 Thanks! I will have a look next week. |
nihalzp
left a comment
There was a problem hiding this comment.
Thanks for working on the feature! Most of it looks good!
I noticed that the time complexity of the function per-row is O(n^2) where n is the size of the array. We should mention that in the documentation. Additionally, the other issue is that if the array contains millions of elements, which is not very uncommon, the function will never finish executing.
There are some advanced O(nlogn) algorithms but they may be bit complicated which we can add later as an improvement. However, since normally users are only interested in the result of first few lags, we can offer them the signature arrayAutocorrelation(array, max_lag); then the time complexity will become O(n * max_lag). We can keep the max_lag optional and if it's not given then we compute result for all lags. In that way, if max_lag is small, our function will be still be useful even if array contains millions of elements.
|
Thanks for the review @nihalzp! The suggestions make sense and I'll keep you posted for the update in the next few days. |
|
@nihalzp updates pushed! I've addressed the comments as well. Apologies it's been taking some time. All relevant CI checks are passing now. Ready for another look when you have a moment. |
nihalzp
left a comment
There was a problem hiding this comment.
Looks good. Previous issues are mostly resolved. I found some queries that do not work as expected:
If we mix const column (first argument) and non-const column for second argument, it cannot handle it. You can have a look at src/Functions/array/arrayDotProduct.cpp for an example of how to handle when one of the argument is constant.
SELECT arrayAutocorrelation([1,2,3], number) FROM numbers(3);Gives:
Code: 44. DB::Exception: Argument must be an array: while executing function arrayAutocorrelation on arguments _CAST([1, 2, 3]_Array(UInt8), 'Array(UInt8)'_String) Array(UInt8) Const(size = 0, Array(size = 1, UInt64(size = 1), UInt8(size = 3))), __table1.number UInt64 UInt64(size = 0). (ILLEGAL_COLUMN)
Since we read max_lag via getInt(i), if we pass a big UInt64, we will wrap around and produce incorrect result silently.
SELECT arrayAutocorrelation([1, 2, 3], toUInt64('9223372036854775808'))Gives incorrectly []. You can look at src/Functions/repeat.cpp for an example of how we can get the numbers column properly.
For the following query, I was expecting it to return [] because the rule seems to be output array length = min(array_size, max_lag) . But returns [nan].
SELECT arrayAutocorrelation([1], 0);arrayAutocorrelation function
LLVM Coverage Report
Changed lines: 81.73% (264/323) · Uncovered code |
514997a
Relates to #87209
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add function
arrayAutocorrelation(arr [, max_lag])that computes the normalized autocorrelation of a numeric array for each lag. Supports integer, float, and decimal array types.Documentation entry for user-facing changes
Motivation: ClickHouse's aggregation functions already support linear/logistic regression, their stochastic versions, and some statistical tests. However, this is not enough for frequent scenarios of building predictive models, which is why we have to build ETL processes. To improve performance, it would be beneficial to move the implementation of at least one complete use case directly into the DBMS (like ELT). The Auto Correlation Function (ACF) is essential for time-series analysis within ClickHouse. It allows users to detect seasonality, periodic patterns, and trends in sequential data (like stock prices, server metrics, or sensor readings) without needing to export data to Python or R.
Parameters: arr: An array of numeric values (Array(Int*) or Array(Float*)). The function automatically casts integers to Float64 for high-precision calculation.
Example use: SELECT arrayAutocorrelation([10, 20, 30, 10, 20, 30]);
Result: [1, -0.25, -0.5, 0.5, 0, -0.25]
(Index 0 is Lag 0, Index 1 is Lag 1, etc.)