Add experimental AI functions llmClassify/Extract/GenerateSQL/Translate/GenerateContent and generateEmbedding[OrNull]#99579
Conversation
|
Workflow [PR], commit [e35f8d2] Summary: ❌
AI ReviewSummaryThis PR introduces experimental LLM/embedding SQL functions, provider integrations, quota/caching logic, and docs. The feature scope is valuable, but there are multiple correctness/safety issues (including out-of-bounds argument access, quota/accounting inconsistencies, potential metadata leakage, and one setting wiring bug) that make current behavior unreliable in production-like environments. I cannot approve until the blockers/majors below are addressed. Findings❌ Blockers
Tests
ClickHouse Rules
Performance & SafetyQuota/call accounting under concurrency and retry paths is currently inconsistent, allowing quota overshoot and API-call cap bypass in realistic workloads. Caching currently lacks resource identity in keys, which can return cross-resource results and violate isolation assumptions. User-LensSeveral argument-handling paths can silently ignore user inputs (row-0 reuse, default-resource fallback), producing plausible but incorrect results without explicit exception, which is particularly risky for AI-assisted SQL/content generation workflows. Final Verdict
|
|
Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
1bd3b70 to
2fa75e6
Compare
src/Functions/LLM/LLMGenerateSQL.cpp
Outdated
|
|
||
| DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override | ||
| { | ||
| if (arguments.empty() || arguments.size() > 5) |
There was a problem hiding this comment.
getReturnTypeImpl accepts one argument, but when that argument matches an existing named collection, hasNamedCollectionArg makes getFirstDataArgIndex return 1. Later executeImpl reads arguments[1] (for the prompt), which is out-of-bounds for a single-argument call.
That means LLMGenerateSQL('existing_collection') can hit undefined behavior instead of returning a clean arity exception. Please add an explicit guard for this ambiguous case (e.g. if first arg resolves to a named collection, require at least 2 arguments).
src/Functions/LLM/LLMGenerateSQL.cpp
Outdated
|
|
||
| DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override | ||
| { | ||
| if (arguments.empty() || arguments.size() > 5) |
There was a problem hiding this comment.
getReturnTypeImpl accepts one argument, but when that argument matches an existing named collection, hasNamedCollectionArg makes getFirstDataArgIndex return 1. Later executeImpl reads arguments[1] (for the prompt), which is out-of-bounds for a single-argument call.
That means LLMGenerateSQL('existing_collection') can hit undefined behavior instead of returning a clean arity exception. Please add an explicit guard for this ambiguous case (e.g. if first arg resolves to a named collection, require at least 2 arguments).
src/Functions/LLM/LLMGenerateSQL.cpp
Outdated
| { | ||
| factory.registerFunction<FunctionLLMGenerateSQL>(FunctionDocumentation{ | ||
| .description = "Generates a ClickHouse SQL query from a natural language description using an LLM.", | ||
| .syntax = "LLMGenerateSQL([collection,] query[, tables][, database][, temperature])", |
There was a problem hiding this comment.
The function signature/docs expose optional tables and database arguments, but this implementation never reads them: buildSystemPrompt always introspects the catalog and buildUserMessage only uses the first data argument.
This creates a user-visible contract mismatch (queries pass extra arguments that are silently ignored). Please either wire these arguments into prompt construction or remove them from the signature and documentation.
|
|
||
| DataTypePtr getReturnTypeImpl(const ColumnsWithTypeAndName & arguments) const override | ||
| { | ||
| if (arguments.empty() || arguments.size() > 4) |
There was a problem hiding this comment.
Same arity bug as in LLMGenerateSQL: a single argument is accepted here, but if that argument matches an existing named collection, getFirstDataArgIndex becomes 1 and executeImpl reads arguments[1] for the prompt.
LLMGenerateContent('existing_collection') should fail with a deterministic arity error, not access out-of-bounds memory. Please add an explicit check that when first arg is a named collection, at least one prompt argument is present.
| for (size_t i = 0; i < batch_items.size(); ++i) | ||
| { | ||
| auto & embedding = (i < resp.embeddings.size()) ? resp.embeddings[i] : resp.embeddings.back(); | ||
|
|
There was a problem hiding this comment.
When resp.embeddings is empty, this fallback dereferences resp.embeddings.back(), which is undefined behavior.
If the provider returns HTTP 200 with malformed payload (or no data entries), resp.embeddings.size() is 0, so the first iteration goes to the fallback and can throw an exception or corrupt memory.
Please guard the empty-response case before calling .back() (for example, fail the batch with a clear exception or map all rows in the batch to empty embeddings in or_null mode).
64aa872 to
27b8bec
Compare
src/Functions/LLM/OpenAIProvider.cpp
Outdated
| { | ||
| throw Exception( | ||
| ErrorCodes::RECEIVED_ERROR_FROM_REMOTE_IO_SERVER, | ||
| "LLM provider returned HTTP {}: {}", static_cast<int>(status), response_body); |
There was a problem hiding this comment.
This exception includes the full remote response_body in the query error text. Provider error payloads often echo request details (prompt content and sometimes credential/context details), so this can leak sensitive data into user-visible exceptions and logs.
Please avoid returning the raw body verbatim. A safer pattern is to include only HTTP status + provider error code/message extracted from JSON, or a truncated/sanitized snippet.
| text_arg_idx = 1; | ||
| dim_arg_idx = 2; | ||
|
|
||
| const auto * first_const = checkAndGetColumn<ColumnConst>(arguments[0].column.get()); |
There was a problem hiding this comment.
When the first argument is not constant, checkAndGetColumn<ColumnConst> returns null and first_arg becomes empty, so the function silently falls back to default_llm_resource instead of using (or rejecting) the provided argument.
That makes calls like generateEmbedding(concat('my_', 'nc'), text, 256) behave incorrectly and unexpectedly. Since endpoint/model selection is query-level, this argument should be required to be a constant string.
Please throw BAD_ARGUMENTS when arguments[0] is not constant in the 3-argument form.
src/Core/Settings.cpp
Outdated
| Maximum number of texts to include in a single HTTP request for embedding functions (generateEmbedding, generateEmbeddingOrNull). Texts are grouped into batches of this size to reduce API call overhead. For example, 500 unique texts with batch size 100 results in 5 HTTP requests. | ||
| )", EXPERIMENTAL) \ | ||
| DECLARE(String, llm_on_quota_exceeded, "throw", R"( | ||
| Behavior when an AI function quota limit (llm_max_rows_per_query, llm_max_input_tokens_per_query, llm_max_output_tokens_per_query, or llm_max_api_calls_per_query) is exceeded. Possible values: 'throw' (default) aborts the query with an exception; 'null' stops making API calls and returns NULL (for LLM functions) or an empty array (for embedding functions) for remaining rows. Ignored by generateEmbeddingOrNull which always uses 'null'. |
There was a problem hiding this comment.
The documented allowed values for llm_on_quota_exceeded here are 'throw' and 'null', but runtime behavior in LLMQuotaTracker::isGracefulQuotaMode also accepts 'break' and 'partial', and quota exceptions explicitly instruct users to set 'break'.
This mismatch makes troubleshooting hard because users get guidance that contradicts the setting docs.
Please align these surfaces (setting docs, validation/accepted values, and exception text) to one canonical set of values.
| for (auto & [dispatch_key, representative_row] : to_dispatch) | ||
| { | ||
| UInt64 rows_for_key = dedup_map[dispatch_key].size(); | ||
| if (!quota->checkBeforeDispatch(0, rows_for_key)) |
There was a problem hiding this comment.
llm_max_input_tokens_per_query is currently not enforced in practice. checkBeforeDispatch only checks input-token quota when estimated_input_tokens > 0, but all call sites pass 0 (here and in LLMGenerateEmbedding). recordResponse updates input_tokens after the response but never checks the max, so queries can exceed the configured input-token budget without exception.
Please either estimate and pass non-zero input tokens before dispatch, or add an input-token limit check inside recordResponse (with consistent throw/null behavior).
|
|
||
| for (size_t i = 0; i < batch_items.size(); ++i) | ||
| { | ||
| const auto & embedding = (i < resp.embeddings.size()) ? resp.embeddings[i] : resp.embeddings.back(); |
There was a problem hiding this comment.
When the provider returns fewer embeddings than requested, this fallback silently reuses the last embedding for all missing rows. That can produce incorrect results by assigning unrelated vectors to inputs without any error.
It is safer to treat size mismatch as an exception (or, for generateEmbeddingOrNull, return empty arrays for affected rows) instead of duplicating resp.embeddings.back().
|
|
||
| String user_message = buildUserMessage(arguments, i); | ||
| std::vector<String> cache_args = {user_message, system_prompt}; | ||
| UInt128 key = LLMResultCache::buildKey(functionName(), config.model, temperature, cache_args); |
There was a problem hiding this comment.
LLMResultCache::buildKey currently excludes the selected provider/resource identity (e.g. collection name, provider, endpoint).
That means two queries using different named collections can collide in cache when functionName, model, temperature, and prompt text are equal, and then one query can reuse another resource's response. This is incorrect and can leak cross-resource results.
Please include resource identity in the cache key (for example provider + normalized endpoint, or collection identifier) for all LLM text functions.
| } | ||
|
|
||
| std::vector<String> cache_args = {text, std::to_string(dimensions)}; | ||
| UInt128 key = LLMResultCache::buildKey(name, model, 0, cache_args); |
There was a problem hiding this comment.
Same cache-key issue in embeddings: LLMResultCache::buildKey is computed from function name, model, dimensions, and text, but not from selected resource identity (provider/endpoint/collection).
If two named collections use the same model and input text but different endpoints/accounts, cached vectors can be reused across resources. For embedding workloads this is a correctness and isolation issue.
Please include resource identity in the embedding cache key as well.
| { | ||
| if (attempt < max_retries && e.code() == ErrorCodes::RECEIVED_ERROR_FROM_REMOTE_IO_SERVER) | ||
| { | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(retry_delay_ms * (1ULL << attempt))); |
There was a problem hiding this comment.
❌ retry_delay_ms * (1ULL << attempt) can trigger undefined behavior when attempt >= 64 (left-shift past bit width), and the multiplication can overflow before conversion to std::chrono::milliseconds.
Because llm_max_retries is user-configurable and currently unbounded, this path is reachable and can lead to incorrect sleep durations or undefined behavior.
Please clamp the exponent / delay before shifting (or use saturating arithmetic), e.g. cap attempt to a safe max and bound the computed delay to a reasonable upper limit.
LLM functions require external HTTP access and named collection configuration that is not available in the unit test environment.
3e7acf5 to
82cecf1
Compare
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 37.35% (583/1561, 0 noise lines excluded) |
| extern const int RECEIVED_ERROR_FROM_REMOTE_IO_SERVER; | ||
| } | ||
|
|
||
| String extractProviderError(const std::string & response_body, int status_code) |
There was a problem hiding this comment.
❌ extractProviderError is defined with external linkage in both OpenAIProvider.cpp and AnthropicProvider.cpp in namespace DB. This creates duplicate non-inline definitions of the same symbol and can fail at link time (or violate ODR depending on toolchain behavior).
Please move this helper into an anonymous namespace / mark it static, or share a single implementation in one .cpp file and reuse it.
| for (const auto * item : batch_items) | ||
| req.inputs.push_back(item->text); | ||
|
|
||
| auto resp = provider->embed(req, timeouts); |
There was a problem hiding this comment.
❌ generateEmbedding does not call quota->recordResponse after provider->embed, so embedding requests never contribute to llm_max_input_tokens_per_query / llm_max_output_tokens_per_query enforcement.
As a result, token quotas can be bypassed via embedding functions even when limits are configured. Please record response token usage here (similar to LLMFunctionBase::executeImpl) so quota behavior is consistent across all AI functions.
d66040d to
4191ae6
Compare
| WHERE database = currentDatabase() AND table = '_03300_ret_embedding'; | ||
| DROP TABLE IF EXISTS _03300_ret_embedding; | ||
|
|
||
| -- generateEmbeddingOrNull returns Nullable(Array(Float32)) |
There was a problem hiding this comment.
Typo in this comment: generateEmbeddingOrNull returns Array(Float32), not Nullable(Array(Float32)).
generateEmbedding and generateEmbeddingOrNull share GenerateEmbeddingFunction::getReturnTypeImpl, which returns Array(Float32), and the .reference file also expects Array(Float32).
Please update this comment to avoid confusion.
… rest of ClickHouse functions
| req.temperature = temperature; | ||
| req.max_tokens = config.max_tokens; | ||
|
|
||
| auto resp = provider->call(req, timeouts); |
There was a problem hiding this comment.
llm_max_api_calls_per_query is effectively bypassed by retries here.
checkBeforeDispatch increments api_calls once before entering the retry loop, but each retry iteration still sends a new HTTP request via provider->call without any additional quota check. With llm_max_api_calls_per_query = 1 and llm_max_retries = 3, a single row can still emit up to 4 remote calls.
Please account API-call quota per actual HTTP attempt (including retries), e.g. by moving/adding the API-call reservation check inside the retry loop before provider->call.
a17b70d to
e35f8d2
Compare
llmClassify/Extract/GenerateSQL/Translate/GenerateContent and generateEmbedding[OrNull]
pjhampton
left a comment
There was a problem hiding this comment.
I have a couple of initial comments on this, skimming through the PR, though it looks okay directionally.
llm<Func>is probably erroneous naming.ai<Func>probably makes more sense, as most/all of these functions shouldn't actually need an LLM. We will be using SLMs, Encoders, and possibly even simpler classifiers/transformers in the cloud to support these. If you are creating these functions with the sole purpose of calling OpenAI / Anthropic, then that is fine - but we will likely need a separate set of functions. I'm not a big fan of thegenerateEmbeddingnaming either - seems inconsistent.- The functions need to be able to take into consideration the underlying model, which should be able to change from the functions arity - ie.
embedFunc(model, text)- we could be providing 3+ models in some cases for users to use. - Not all of these functions we will want to support I don't think - Looking at
GenerateSQL- but maybe I'm missing context - I'm confused how the caching works here, and how it benefits the query performance - especially around embeddings. It would be good if you could talk me through the code at some point.
- The system prompts you hardcoded would be good to override from a users PoV and could be developed further. In the case of using this in our ai cloud we will likely own the prompts on our side.
- These calls can get very expensive when hooked up to an external provider - we noticed recently that snowflake provided a way for users to be granted to use similar functions - it is probably a good time to implement this here. We might be able to do something here in the control plane but ideally in the core.
It works in a similar fashion a the query cache.
Agree that's a good idea.
Yeah I've added a lot of settings to be able to limit how much call a single query can do to a provider but we should have a longer term plan for it. I have a few ideas. Let's discuss all of that on Wednesday and define what goes in a V1 and V2. |
|
Somehow, I can't merge master again to this branch. Let's continue in #100735 |
Introduces a bunch oif experimental AI functions that call external LLM/embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.
Uses Named Collections for storing provider credentials and query-level settings for operational parameters (like temperature and error handling).
All functions are gated behind the
allow_experimental_ai_functionssetting.Changelog category (leave one):
Changelog entry (a user-readable short description):
Introduces experimental AI functions
llmClassify,llmExtract,llmGenerateSQL,llmTranslate,llmGenerateContent,generateEmbedding,generateEmbeddingOrNullthat call an external LLM and embedding providers (OpenAI, Anthropic) directly from SQL queries without leaving ClickHouse.Documentation entry for user-facing changes
Documentation is in progress.
The functions works in the following ways:
LLMClassify
LLMTranslate
LLMExtract
LLMGenerateSQL
LLMGenerateContent
generateEmbedding / generateEmbeddingOrNull