Add experimental AI functions `llmClassify/Extract/GenerateSQL/Translate/GenerateContent` and `generateEmbedding[OrNull]` by melvynator · Pull Request #99579 · ClickHouse/ClickHouse

melvynator · 2026-03-16T10:32:21Z

Introduces a bunch oif experimental AI functions that call external LLM/embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.

Uses Named Collections for storing provider credentials and query-level settings for operational parameters (like temperature and error handling).

All functions are gated behind the allow_experimental_ai_functions setting.

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description):

Introduces experimental AI functions llmClassify, llmExtract, llmGenerateSQL, llmTranslate, llmGenerateContent, generateEmbedding, generateEmbeddingOrNull that call an external LLM and embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Documentation is in progress.

The functions works in the following ways:

LLMClassify

SELECT
    LLMClassify(message, ['Fatal', 'Critical', 'Error', 'Warning', 'Notice', 'Information', 'Debug', 'Trace', 'Test']),
    level,
    left(message, 30)
FROM text_log_no_index
WHERE level IN ('Fatal', 'Critical', 'Error', 'Warning')
LIMIT 240, 10

LLMTranslate

SELECT LLMTranslate('Here are some new ClickHouse functions to ease the use of LLM directly within the database!', 'ru', 'Do not translate technical content')

LLMExtract

SELECT LLMExtract(text, '{"company": "company name", "location": "city"}') AS info
FROM default.hackernews
LIMIT 10

LLMGenerateSQL

SELECT LLMGenerateSQL('Show me the top 5 categories')

LLMGenerateContent

SELECT LLMGenerateContent('Summarize the following log entry')

generateEmbedding / generateEmbeddingOrNull

SELECT generateEmbedding('ClickHouse is fast', 256) AS emb
SELECT generateEmbeddingOrNull('test_embedding', text, 4) AS emb FROM test

clickhouse-gh · 2026-03-16T10:33:02Z

Workflow [PR], commit [e35f8d2]

Summary: ❌

job_name	test_name	status	info
Style check		failure
	various	failure	cidb
	aspell	failure	cidb
Fast test		failure
	Build ClickHouse	failure
Docs check		dropped
Build (amd_debug)		dropped
Build (amd_asan_ubsan)		dropped
Build (amd_tsan)		dropped
Build (amd_msan)		dropped
Build (amd_binary)		dropped
Build (arm_asan_ubsan)		dropped
Build (arm_binary)		dropped

AI Review

Summary

This PR introduces experimental LLM/embedding SQL functions, provider integrations, quota/caching logic, and docs. The feature scope is valuable, but there are multiple correctness/safety issues (including out-of-bounds argument access, quota/accounting inconsistencies, potential metadata leakage, and one setting wiring bug) that make current behavior unreliable in production-like environments. I cannot approve until the blockers/majors below are addressed.

Findings

❌ Blockers

[src/Functions/FunctionBaseLLM.cpp:70] llm_max_rows_per_query is initialized from llm_cache_ttl_sec, so row quota ignores the intended setting and follows cache TTL instead.
[src/Functions/LLM/LLMGenerateSQL.cpp] Named-collection arity ambiguity can make arguments[1] out-of-bounds for single-argument calls.
[src/Functions/llmGenerateContent.cpp:34] Same named-collection arity out-of-bounds risk.
[src/Functions/llmClassify.cpp:42] Named-collection path can make arguments[idx + 1] out-of-bounds.
[src/Functions/llmExtract.cpp:36] Named-collection path can make arguments[idx + 1] out-of-bounds.
[src/Functions/llmTranslate.cpp:33] Named-collection path can make arguments[idx + 1] out-of-bounds.
[src/Functions/generateEmbedding.cpp:355] Empty provider response can dereference resp.embeddings.back() (undefined behavior).
[src/Functions/LLM/OpenAIProvider.cpp:26] extractProviderError duplicate non-inline definition across providers risks ODR/link issues.

⚠️ Majors

[src/Functions/llmGenerateSQL.cpp:95] Schema introspection enumerates databases/tables without explicit access filtering, risking metadata disclosure.
[src/Functions/generateEmbedding.cpp:162] Allowing arbitrary URL-like first argument enables user-controlled outbound endpoints (SSRF risk) unless additionally gated.
[src/Functions/LLM/OpenAIProvider.cpp, src/Functions/LLM/AnthropicProvider.cpp] Raw provider response bodies are surfaced in exceptions; can leak sensitive prompt/context data.
[src/Functions/FunctionBaseLLM.cpp:207, src/Functions/generateEmbedding.cpp:250] Cache key omits resource identity; cross-collection/provider cache collisions are possible.
[src/Functions/FunctionBaseLLM.cpp:274] API-call quota is reserved once per dedup key, but retries issue extra calls; effective API-call cap can be bypassed.
[src/Functions/FunctionBaseLLM.cpp:248, src/Functions/FunctionBaseLLM.cpp:276] Input-token cap is not effectively enforced pre-dispatch (always passed as 0), so configured limit can be exceeded before control.
[src/Functions/generateEmbedding.cpp:332] Embedding flow does not call recordResponse, so token quota accounting is inconsistent vs text LLM functions.
[src/Functions/LLM/LLMQuotaTracker.cpp:77] Output-token checks are post-fetch_add; concurrent requests can significantly overshoot quota before first rejection.
[src/Functions/FunctionBaseLLM.cpp:305] Exponential backoff expression may overflow/shift unsafely for large retry counts.
[src/Functions/FunctionBaseLLM.cpp:349] Nulling decision uses global isQuotaExceeded and can misclassify non-quota failures as quota-null rows.
[src/Core/Settings.cpp:7809 vs src/Functions/LLM/LLMQuotaTracker.cpp] llm_on_quota_exceeded docs/runtime accepted values are inconsistent.
[src/Core/SettingsChangesHistory.cpp] Same new settings appear in two version blocks, making compatibility history inconsistent.
[src/Functions/llmGenerateSQL.cpp] Signature/docs mention optional tables/database controls, but implementation ignores them (contract mismatch).
[src/Functions/generateEmbedding.cpp:354] When fewer embeddings are returned than inputs, reusing back() silently assigns incorrect vectors to unrelated rows.
[src/Functions/generateEmbedding.cpp:156] Non-constant collection argument silently falls back to default resource instead of deterministic rejection.
[src/Functions/llmTranslate.cpp:46, src/Functions/llmExtract.cpp:54, src/Functions/llmGenerateContent.cpp:55] Non-constant prompt-control args are read from row 0 and reused for all rows (silent wrong results).

Tests

⚠️ Add argument-safety tests for each function with named-collection-first forms to ensure deterministic NUMBER_OF_ARGUMENTS_DOESNT_MATCH instead of out-of-bounds access.

⚠️ Add quota-behavior tests covering retries, embedding token accounting, and concurrent output-token limits to verify settings are actually enforced.

⚠️ Add access-control test for llmGenerateSQL schema prompt construction to confirm unauthorized databases/tables/columns are excluded.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	❌	Multiple correctness/safety issues in new core execution path (`FunctionBaseLLM`, quotas, cache).
No test removal	✅
Experimental gate	✅
No magic constants	✅
Backward compatibility	⚠️	`SettingsChangesHistory` has duplicate introduction points for new settings.
`SettingsChangesHistory.cpp`	❌	Same settings added in more than one compatibility block.
PR metadata quality	✅
Safe rollout	❌	Out-of-bounds paths and quota/accounting mismatches require fixes before rollout.
Compilation time	✅

Performance & Safety

Quota/call accounting under concurrency and retry paths is currently inconsistent, allowing quota overshoot and API-call cap bypass in realistic workloads. Caching currently lacks resource identity in keys, which can return cross-resource results and violate isolation assumptions.

User-Lens

Several argument-handling paths can silently ignore user inputs (row-0 reuse, default-resource fallback), producing plausible but incorrect results without explicit exception, which is particularly risky for AI-assisted SQL/content generation workflows.

Final Verdict

Status: ❌ Block
Minimum required actions:
- Fix all named-collection arity out-of-bounds paths and add regression tests.
- Fix quota/accounting consistency (llm_max_rows_per_query initializer, retries/API-call accounting, embedding token accounting).
- Eliminate sensitive raw provider body leakage and address cache-key resource isolation.

CLAassistant · 2026-03-16T10:39:02Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ rschu1ze
✅ melvynator
❌ Ubuntu

Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

clickhouse-gh · 2026-03-21T23:13:41Z

src/Functions/LLM/LLMGenerateSQL.cpp

+
+    DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
+    {
+        if (arguments.empty() || arguments.size() > 5)


getReturnTypeImpl accepts one argument, but when that argument matches an existing named collection, hasNamedCollectionArg makes getFirstDataArgIndex return 1. Later executeImpl reads arguments[1] (for the prompt), which is out-of-bounds for a single-argument call.

That means LLMGenerateSQL('existing_collection') can hit undefined behavior instead of returning a clean arity exception. Please add an explicit guard for this ambiguous case (e.g. if first arg resolves to a named collection, require at least 2 arguments).

clickhouse-gh · 2026-03-21T23:13:50Z

src/Functions/LLM/LLMGenerateSQL.cpp

+
+    DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
+    {
+        if (arguments.empty() || arguments.size() > 5)


getReturnTypeImpl accepts one argument, but when that argument matches an existing named collection, hasNamedCollectionArg makes getFirstDataArgIndex return 1. Later executeImpl reads arguments[1] (for the prompt), which is out-of-bounds for a single-argument call.

That means LLMGenerateSQL('existing_collection') can hit undefined behavior instead of returning a clean arity exception. Please add an explicit guard for this ambiguous case (e.g. if first arg resolves to a named collection, require at least 2 arguments).

clickhouse-gh · 2026-03-21T23:13:59Z

src/Functions/LLM/LLMGenerateSQL.cpp

+{
+    factory.registerFunction<FunctionLLMGenerateSQL>(FunctionDocumentation{
+        .description = "Generates a ClickHouse SQL query from a natural language description using an LLM.",
+        .syntax = "LLMGenerateSQL([collection,] query[, tables][, database][, temperature])",


The function signature/docs expose optional tables and database arguments, but this implementation never reads them: buildSystemPrompt always introspects the catalog and buildUserMessage only uses the first data argument.

This creates a user-visible contract mismatch (queries pass extra arguments that are silently ignored). Please either wire these arguments into prompt construction or remove them from the signature and documentation.

clickhouse-gh · 2026-03-21T23:14:00Z

src/Functions/llmGenerateContent.cpp

+
+    DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override
+    {
+        if (arguments.empty() || arguments.size() > 4)


Same arity bug as in LLMGenerateSQL: a single argument is accepted here, but if that argument matches an existing named collection, getFirstDataArgIndex becomes 1 and executeImpl reads arguments[1] for the prompt.

LLMGenerateContent('existing_collection') should fail with a deterministic arity error, not access out-of-bounds memory. Please add an explicit check that when first arg is a named collection, at least one prompt argument is present.

clickhouse-gh · 2026-03-21T23:57:50Z

src/Functions/generateEmbedding.cpp

+                                for (size_t i = 0; i < batch_items.size(); ++i)
+                                {
+                                    auto & embedding = (i < resp.embeddings.size()) ? resp.embeddings[i] : resp.embeddings.back();
+


When resp.embeddings is empty, this fallback dereferences resp.embeddings.back(), which is undefined behavior.

If the provider returns HTTP 200 with malformed payload (or no data entries), resp.embeddings.size() is 0, so the first iteration goes to the fallback and can throw an exception or corrupt memory.

Please guard the empty-response case before calling .back() (for example, fail the batch with a clear exception or map all rows in the batch to empty embeddings in or_null mode).

clickhouse-gh · 2026-03-22T00:09:04Z

src/Functions/LLM/OpenAIProvider.cpp

+    {
+        throw Exception(
+            ErrorCodes::RECEIVED_ERROR_FROM_REMOTE_IO_SERVER,
+            "LLM provider returned HTTP {}: {}", static_cast<int>(status), response_body);


This exception includes the full remote response_body in the query error text. Provider error payloads often echo request details (prompt content and sometimes credential/context details), so this can leak sensitive data into user-visible exceptions and logs.

Please avoid returning the raw body verbatim. A safer pattern is to include only HTTP status + provider error code/message extracted from JSON, or a truncated/sanitized snippet.

clickhouse-gh · 2026-03-22T00:09:04Z

src/Functions/generateEmbedding.cpp

+            text_arg_idx = 1;
+            dim_arg_idx = 2;
+
+            const auto * first_const = checkAndGetColumn<ColumnConst>(arguments[0].column.get());


When the first argument is not constant, checkAndGetColumn<ColumnConst> returns null and first_arg becomes empty, so the function silently falls back to default_llm_resource instead of using (or rejecting) the provided argument.

That makes calls like generateEmbedding(concat('my_', 'nc'), text, 256) behave incorrectly and unexpectedly. Since endpoint/model selection is query-level, this argument should be required to be a constant string.

Please throw BAD_ARGUMENTS when arguments[0] is not constant in the 3-argument form.

clickhouse-gh · 2026-03-22T00:09:04Z

src/Core/Settings.cpp

+Maximum number of texts to include in a single HTTP request for embedding functions (generateEmbedding, generateEmbeddingOrNull). Texts are grouped into batches of this size to reduce API call overhead. For example, 500 unique texts with batch size 100 results in 5 HTTP requests.
+)", EXPERIMENTAL) \
+    DECLARE(String, llm_on_quota_exceeded, "throw", R"(
+Behavior when an AI function quota limit (llm_max_rows_per_query, llm_max_input_tokens_per_query, llm_max_output_tokens_per_query, or llm_max_api_calls_per_query) is exceeded. Possible values: 'throw' (default) aborts the query with an exception; 'null' stops making API calls and returns NULL (for LLM functions) or an empty array (for embedding functions) for remaining rows. Ignored by generateEmbeddingOrNull which always uses 'null'.


The documented allowed values for llm_on_quota_exceeded here are 'throw' and 'null', but runtime behavior in LLMQuotaTracker::isGracefulQuotaMode also accepts 'break' and 'partial', and quota exceptions explicitly instruct users to set 'break'.

This mismatch makes troubleshooting hard because users get guidance that contradicts the setting docs.

Please align these surfaces (setting docs, validation/accepted values, and exception text) to one canonical set of values.

clickhouse-gh · 2026-03-22T00:29:20Z

src/Functions/FunctionBaseLLM.cpp

+        for (auto & [dispatch_key, representative_row] : to_dispatch)
+        {
+            UInt64 rows_for_key = dedup_map[dispatch_key].size();
+            if (!quota->checkBeforeDispatch(0, rows_for_key))


llm_max_input_tokens_per_query is currently not enforced in practice. checkBeforeDispatch only checks input-token quota when estimated_input_tokens > 0, but all call sites pass 0 (here and in LLMGenerateEmbedding). recordResponse updates input_tokens after the response but never checks the max, so queries can exceed the configured input-token budget without exception.

Please either estimate and pass non-zero input tokens before dispatch, or add an input-token limit check inside recordResponse (with consistent throw/null behavior).

clickhouse-gh · 2026-03-22T00:29:20Z

src/Functions/generateEmbedding.cpp

+
+                                for (size_t i = 0; i < batch_items.size(); ++i)
+                                {
+                                    const auto & embedding = (i < resp.embeddings.size()) ? resp.embeddings[i] : resp.embeddings.back();


When the provider returns fewer embeddings than requested, this fallback silently reuses the last embedding for all missing rows. That can produce incorrect results by assigning unrelated vectors to inputs without any error.

It is safer to treat size mismatch as an exception (or, for generateEmbeddingOrNull, return empty arrays for affected rows) instead of duplicating resp.embeddings.back().

clickhouse-gh · 2026-03-22T00:39:50Z

src/Functions/FunctionBaseLLM.cpp

+
+        String user_message = buildUserMessage(arguments, i);
+        std::vector<String> cache_args = {user_message, system_prompt};
+        UInt128 key = LLMResultCache::buildKey(functionName(), config.model, temperature, cache_args);


LLMResultCache::buildKey currently excludes the selected provider/resource identity (e.g. collection name, provider, endpoint).

That means two queries using different named collections can collide in cache when functionName, model, temperature, and prompt text are equal, and then one query can reuse another resource's response. This is incorrect and can leak cross-resource results.

Please include resource identity in the cache key (for example provider + normalized endpoint, or collection identifier) for all LLM text functions.

clickhouse-gh · 2026-03-22T00:39:51Z

src/Functions/generateEmbedding.cpp

+            }
+
+            std::vector<String> cache_args = {text, std::to_string(dimensions)};
+            UInt128 key = LLMResultCache::buildKey(name, model, 0, cache_args);


Same cache-key issue in embeddings: LLMResultCache::buildKey is computed from function name, model, dimensions, and text, but not from selected resource identity (provider/endpoint/collection).

If two named collections use the same model and input text but different endpoints/accounts, cached vectors can be reused across resources. For embedding workloads this is a correctness and isolation issue.

Please include resource identity in the embedding cache key as well.

clickhouse-gh · 2026-03-22T01:00:32Z

src/Functions/FunctionBaseLLM.cpp

+                    {
+                        if (attempt < max_retries && e.code() == ErrorCodes::RECEIVED_ERROR_FROM_REMOTE_IO_SERVER)
+                        {
+                            std::this_thread::sleep_for(std::chrono::milliseconds(retry_delay_ms * (1ULL << attempt)));


❌ retry_delay_ms * (1ULL << attempt) can trigger undefined behavior when attempt >= 64 (left-shift past bit width), and the multiplication can overflow before conversion to std::chrono::milliseconds.

Because llm_max_retries is user-configurable and currently unbounded, this path is reachable and can lead to incorrect sleep durations or undefined behavior.

Please clamp the exponent / delay before shifting (or use saturating arithmetic), e.g. cap attempt to a safe max and bound the computed delay to a reasonable upper limit.

LLM functions require external HTTP access and named collection configuration that is not available in the unit test environment.

clickhouse-gh · 2026-03-22T18:16:12Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.90%	83.80%	-0.10%
Functions	24.50%	24.80%	+0.30%
Branches	76.50%	76.40%	-0.10%

PR changed lines: PR changed-lines coverage: 37.35% (583/1561, 0 noise lines excluded)
Diff coverage report
Uncovered code

…ctions

clickhouse-gh · 2026-03-23T11:37:53Z

src/Functions/LLM/OpenAIProvider.cpp

+    extern const int RECEIVED_ERROR_FROM_REMOTE_IO_SERVER;
+}
+
+String extractProviderError(const std::string & response_body, int status_code)


❌ extractProviderError is defined with external linkage in both OpenAIProvider.cpp and AnthropicProvider.cpp in namespace DB. This creates duplicate non-inline definitions of the same symbol and can fail at link time (or violate ODR depending on toolchain behavior).

Please move this helper into an anonymous namespace / mark it static, or share a single implementation in one .cpp file and reuse it.

clickhouse-gh · 2026-03-23T11:38:08Z

src/Functions/generateEmbedding.cpp

+                            for (const auto * item : batch_items)
+                                req.inputs.push_back(item->text);
+
+                            auto resp = provider->embed(req, timeouts);


❌ generateEmbedding does not call quota->recordResponse after provider->embed, so embedding requests never contribute to llm_max_input_tokens_per_query / llm_max_output_tokens_per_query enforcement.

As a result, token quotas can be bypassed via embedding functions even when limits are configured. Please record response token usage here (similar to LLMFunctionBase::executeImpl) so quota behavior is consistent across all AI functions.

clickhouse-gh · 2026-03-23T12:00:33Z

tests/queries/0_stateless/03300_llm_functions.sql

+    WHERE database = currentDatabase() AND table = '_03300_ret_embedding';
+DROP TABLE IF EXISTS _03300_ret_embedding;
+
+-- generateEmbeddingOrNull returns Nullable(Array(Float32))


Typo in this comment: generateEmbeddingOrNull returns Array(Float32), not Nullable(Array(Float32)).

generateEmbedding and generateEmbeddingOrNull share GenerateEmbeddingFunction::getReturnTypeImpl, which returns Array(Float32), and the .reference file also expects Array(Float32).

Please update this comment to avoid confusion.

… rest of ClickHouse functions

clickhouse-gh · 2026-03-23T12:34:13Z

src/Functions/FunctionBaseLLM.cpp

+                        req.temperature = temperature;
+                        req.max_tokens = config.max_tokens;
+
+                        auto resp = provider->call(req, timeouts);


llm_max_api_calls_per_query is effectively bypassed by retries here.

checkBeforeDispatch increments api_calls once before entering the retry loop, but each retry iteration still sends a new HTTP request via provider->call without any additional quota check. With llm_max_api_calls_per_query = 1 and llm_max_retries = 3, a single row can still emit up to 4 remote calls.

Please account API-call quota per actual HTTP attempt (including retries), e.g. by moving/adding the API-call reservation check inside the retry loop before provider->call.

src/Functions/FunctionBaseLLM.cpp

pjhampton

I have a couple of initial comments on this, skimming through the PR, though it looks okay directionally.

llm<Func> is probably erroneous naming. ai<Func> probably makes more sense, as most/all of these functions shouldn't actually need an LLM. We will be using SLMs, Encoders, and possibly even simpler classifiers/transformers in the cloud to support these. If you are creating these functions with the sole purpose of calling OpenAI / Anthropic, then that is fine - but we will likely need a separate set of functions. I'm not a big fan of the generateEmbedding naming either - seems inconsistent.
The functions need to be able to take into consideration the underlying model, which should be able to change from the functions arity - ie. embedFunc(model, text) - we could be providing 3+ models in some cases for users to use.
Not all of these functions we will want to support I don't think - Looking at GenerateSQL - but maybe I'm missing context
I'm confused how the caching works here, and how it benefits the query performance - especially around embeddings. It would be good if you could talk me through the code at some point.
The system prompts you hardcoded would be good to override from a users PoV and could be developed further. In the case of using this in our ai cloud we will likely own the prompts on our side.
These calls can get very expensive when hooked up to an external provider - we noticed recently that snowflake provided a way for users to be granted to use similar functions - it is probably a good time to implement this here. We might be able to do something here in the control plane but ideally in the core.

melvynator · 2026-03-23T14:09:06Z

I'm confused how the caching works here, and how it benefits the query performance - especially around embeddings. It would be good if you could talk me through the code at some point.

It works in a similar fashion a the query cache.

The system prompts you hardcoded would be good to override from a users PoV and could be developed further. In the case of using this in our ai cloud we will likely own the prompts on our side.

Agree that's a good idea.

These calls can get very expensive when hooked up to an external provider - we noticed recently that snowflake provided a way for users to be granted to use similar functions - it is probably a good time to implement this here. We might be able to do something here in the control plane but ideally in the core.

Yeah I've added a lot of settings to be able to limit how much call a single query can do to a provider but we should have a longer term plan for it. I have a few ideas.

Let's discuss all of that on Wednesday and define what goes in a V1 and V2.

rschu1ze · 2026-03-25T16:25:05Z

Somehow, I can't merge master again to this branch.

Let's continue in #100735

This comment was marked as resolved.

Sign in to view

melvynator mentioned this pull request Mar 21, 2026

feat(aiml): AI_EMBED / AI_EMBED_OR_NULL functions #100232

Closed

1 task

This comment was marked as resolved.

Sign in to view

melvynator requested review from pjhampton and rschu1ze March 21, 2026 17:34

melvynator added the pr-experimental Experimental Feature label Mar 21, 2026

This comment was marked as resolved.

Sign in to view

melvynator marked this pull request as ready for review March 21, 2026 17:51

melvynator force-pushed the feature/llm-functions branch 2 times, most recently from 1bd3b70 to 2fa75e6 Compare March 21, 2026 23:00

melvynator changed the title ~~Add LLM functions: LLMClassify, LLMExtract, LLMGenerateSQL, LLMTranslate~~ Add experimental AI functions: LLMClassify, LLMExtract, LLMGenerateSQL, LLMTranslate, LLMGenerateContent, generateEmbedding, generateEmbeddingOrNull Mar 21, 2026

clickhouse-gh bot reviewed Mar 21, 2026

View reviewed changes

melvynator force-pushed the feature/llm-functions branch from 64aa872 to 27b8bec Compare March 21, 2026 23:58

clickhouse-gh bot reviewed Mar 22, 2026

View reviewed changes

melvynator added 2 commits March 22, 2026 15:23

Exclude LLM functions from FunctionsStress unit test

eb3b474

LLM functions require external HTTP access and named collection configuration that is not available in the unit test environment.

Fix Throttler constructor call to include name parameter

82cecf1

melvynator force-pushed the feature/llm-functions branch from 3e7acf5 to 82cecf1 Compare March 22, 2026 15:44

rschu1ze added 2 commits March 23, 2026 10:39

Merge remote-tracking branch 'ClickHouse/master' into feature/llm-fun…

3e0d5bf

…ctions

Some minor fixups

a3ebcde

clickhouse-gh bot reviewed Mar 23, 2026

View reviewed changes

rschu1ze added 2 commits March 23, 2026 11:46

Update function category

9fe605c

Move some code around

4191ae6

rschu1ze force-pushed the feature/llm-functions branch from d66040d to 4191ae6 Compare March 23, 2026 11:53

rschu1ze mentioned this pull request Mar 23, 2026

Supports AI-powered functions using LLM. #85442

Closed

1 task

clickhouse-gh bot reviewed Mar 23, 2026

View reviewed changes

rschu1ze mentioned this pull request Mar 23, 2026

Add ai_generate_text function for AI-powered text generation #99178

Closed

1 task

Simplify the test, rename LLM --> llm for naming consistency with the…

e68354c

… rest of ClickHouse functions

clickhouse-gh bot reviewed Mar 23, 2026

View reviewed changes

Fix fragile settings mess with settings

e35f8d2

rschu1ze force-pushed the feature/llm-functions branch from a17b70d to e35f8d2 Compare March 23, 2026 12:50

rschu1ze self-assigned this Mar 23, 2026

clickhouse-gh bot reviewed Mar 23, 2026

View reviewed changes

src/Functions/FunctionBaseLLM.cpp Show resolved Hide resolved

pjhampton reviewed Mar 23, 2026

View reviewed changes

rschu1ze added 2 commits March 23, 2026 16:26

Move work on fragile settings read

28725e8

Cosmetics

2c547c1

This was referenced Mar 25, 2026

*ignore* Add generateEmbedding function and AI infrastructure #100712

Closed

*wip* Add experimental AI functions aiClassify/Extract/GenerateSQL/Translate/GenerateContent/GenerateEmbedding #100735

Closed

rschu1ze closed this Mar 25, 2026

groeneai mentioned this pull request Mar 31, 2026

Fix flaky test 03715_projection_settings_zookeeper #101065

Open

1 task

Conversation

melvynator commented Mar 16, 2026 • edited by rschu1ze Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

ClickHouse Rules

Performance & Safety

User-Lens

Final Verdict

Uh oh!

CLAassistant commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Mar 22, 2026

LLVM Coverage Report

Uh oh!

clickhouse-gh bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 23, 2026

Choose a reason for hiding this comment

melvynator commented Mar 16, 2026 •

edited by rschu1ze

Loading

clickhouse-gh bot commented Mar 16, 2026 •

edited

Loading

CLAassistant commented Mar 16, 2026 •

edited

Loading