# SQLite-AI: API Reference This document provides reference-level documentation for all public SQLite-AI functions, virtual tables, and metadata properties exposed to SQL. These functions enable loading and interacting with LLMs, configuring samplers, generating embeddings and text, and managing chat sessions. --- ## `ai_version()` **Returns:** `TEXT` **Description:** Returns the current version of the SQLite-AI extension. **Example:** ```sql SELECT ai_version(); -- e.g., '1.0.0' ``` --- ## `ai_log_info(extended_enable BOOLEAN)` **Returns:** `NULL` **Description:** Enables or disables extended logging information. Use `1` to enable, `0` to disable. **Example:** ```sql SELECT ai_log_info(1); ``` --- ## `llm_model_load(path TEXT, options TEXT)` **Returns:** `NULL` **Description:** Loads a GGUF model from the specified file path with optional comma separated key=value configuration. If no options are provided the following default value is used: `gpu_layers=99` The following keys are available: ``` gpu_layers=N (N is the number of layers to store in VRAM) main_gpu=K (K is the GPU that is used for the entire model when split_mode is 0) split_mode=N (how to split the model across multiple GPUs, 0 means none, 1 means layer, 2 means rows) vocab_only=1/0 (only load the vocabulary, no weights) use_mmap=1/0 (use mmap if possible) use_mlock=1/0 (force system to keep model in RAM) check_tensors=1/0 (validate model tensor data) log_info=1/0 (enable/disable the logging of info) ``` **Example:** ```sql SELECT llm_model_load('./models/llama.gguf', 'gpu_layers=99'); ``` --- ## `llm_model_free()` **Returns:** `NULL` **Description:** Unloads the current model and frees associated memory. **Example:** ```sql SELECT llm_model_free(); ``` --- ## `llm_context_create(context_settings TEXT)` **Parameters:** context_settings: comma-separated key=value pairs (see [context settings](#context settings)). **Returns:** `NULL` **Description:** Creates a new inference context with comma separated key=value configuration. **Context must explicitly created before performing any AI operation!** ## context_settings The following keys are available in context_settings: ### General | Key | Type | Meaning | | ------------------------| -------- | ---------------------------------------------------------------- | | `generate_embedding` | `1 or 0` | Force the model to generate embeddings. | | `normalize_embedding` | `1 or 0` | Force normalization during embedding generation (default to 1). | | `json_output` | `1 or 0` | Force JSON output in embedding generation (default to 0). | | `max_tokens` | `number` | Set a maximum number of tokens in input. If input is too large then an error is returned. | | `n_predict` | `number` | Control the maximum number of tokens generated during text generation. | | `embedding_type` | `FLOAT32, FLOAT16, BFLOAT16, UINT8, INT8` | Set the model native type, mandatory during embedding generation. | ### Core sizing & threading | Key | Type | Meaning | | ------------------------ | -------- | ---------------------------------------------------------------- | | `context_size` | `number` | Equivalent to n_ctx = N and n_batch = N. | | `n_ctx` | `number` | Text context length (tokens). `0` = from model. | | `n_batch` | `number` | **Logical** max batch size submitted to `llama_decode`. | | `n_ubatch` | `number` | **Physical** max micro-batch size. | | `n_seq_max` | `number` | Max concurrent sequences (parallel states for recurrent models). | | `n_threads` | `number` | Threads for generation. | | `n_threads_batch` | `number` | Threads for batch processing. | ### Attention, pooling & flash-attention | Key | Type | Meaning | | ----------------- | ---------------------------- | ------------------------------------------------- | | `pooling_type` | `none, unspecified, mean, cls, last or rank` | How to aggregate token embeddings (e.g., `mean`). | | `attention_type` | `unspecified, causal, non_causal` | Attention algorithm for embeddings. | | `flash_attn_type` | `auto, disabled, enabled` | Controls when/if Flash-Attention is used. | ### RoPE & YaRN (positional scaling) | Key | Type | Meaning | | ------------------- | ------------------------------ | ------------------------------------------------- | | `rope_scaling_type` | `unspecified, none, linear, yarn, longrope` | RoPE scaling strategy. | | `rope_freq_base` | `float number` | RoPE base frequency. `0` = from model. | | `rope_freq_scale` | `float number` | RoPE frequency scaling factor. `0` = from model. | | `yarn_ext_factor` | `float number` | YaRN extrapolation mix factor. `<0` = from model. | | `yarn_attn_factor` | `float number` | YaRN magnitude scaling. | | `yarn_beta_fast` | `float number` | YaRN low correction dimension. | | `yarn_beta_slow` | `float number` | YaRN high correction dimension. | | `yarn_orig_ctx` | `number` | YaRN original context size. | ### KV cache types (experimental) | Key | Type | Meaning | | -------- | ---------------- | ---------------------- | | `type_k` | [ggml_type](https://github.com/ggml-org/llama.cpp/blob/00681dfc16ba4cebb9c7fbd2cf2656e06a0692a4/ggml/include/ggml.h#L377) | Data type for K cache. | | `type_v` | [ggml_type](https://github.com/ggml-org/llama.cpp/blob/00681dfc16ba4cebb9c7fbd2cf2656e06a0692a4/ggml/include/ggml.h#L377) | Data type for V cache. | ### Flags > Place booleans at the end of your option string if you’re copy-by-value mirroring a struct; otherwise order doesn’t matter. | Key | Type | Meaning | | -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | | `embeddings` | `1 or 0` | If `1`, extract embeddings (with logits). Used by the embedding preset. | | `offload_kqv` | `1 or 0` | Offload KQV ops (incl. KV cache) to GPU. | | `no_perf` | `1 or 0` | Disable performance timing. | | `op_offload` | `1 or 0` | Offload host tensor ops to device. | | `swa_full` | `1 or 0` | Use full-size SWA cache. When `false` and `n_seq_max > 1`, performance may degrade. | | `kv_unified` | `1 or 0` | Use a unified buffer across input sequences during attention. Try disabling when `n_seq_max > 1` and sequences do not share a long prefix. | | `defrag_thold` | `float number` | **Deprecated.** Defragment KV cache if `holes/size > thold`. `<= 0` disables. | --- **Example:** ```sql SELECT llm_context_create('n_ctx=2048,n_threads=6,n_batch=256'); ``` --- ## `llm_context_create_embedding(context_settings TEXT)` **Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`). **Returns:** `NULL` **Description:** Creates a new inference context specifically set for embedding generation. It is equivalent to `SELECT llm_context_create('generate_embedding=1,normalize_embedding=1,pooling_type=mean');` **Context must explicitly created before performing any AI operation!** **Example:** ```sql SELECT llm_context_create_embedding(); ``` --- ## `llm_context_create_chat(context_settings TEXT)` **Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`). **Returns:** `NULL` **Description:** Creates a new inference context specifically set for chat conversation. It is equivalent to `SELECT llm_context_create('context_size=4096');` Context must explicitly created before performing any AI operation! **Example:** ```sql SELECT llm_context_create_chat(); ``` --- ## `llm_context_create_textgen(context_settings TEXT)` **Parameters:** **`context_settings` (optional):** Comma-separated `key=value` pairs to override or extend default settings (see [context settings](#context_settings) in `llm_context_create`). **Returns:** `NULL` **Description:** Creates a new inference context specifically set for text generation. It is equivalent to `SELECT llm_context_create('context_size=4096');` Context must explicitly created before performing any AI operation! **Example:** ```sql SELECT llm_context_create_textgen(); ``` --- ## `llm_context_free()` **Returns:** `NULL` **Description:** Frees the current inference context. **Example:** ```sql SELECT llm_context_free(); ``` --- ## `llm_context_size()` **Returns:** `INTEGER` **Description**: Returns the total token capacity (context window) of the current llama context. Use this after `llm_context_create` to confirm the configured `context_size`. Raises an error if no context is active. ```sql SELECT llm_context_size(); -- 4096 ``` --- ## `llm_context_used()` **Returns:** `INTEGER` **Description:** Returns how many tokens of the current llama context have already been consumed. Combine this with `llm_context_size()` to monitor usage. Raises an error if no context is active. **Example:** ```sql SELECT llm_context_used(); -- 1024 ``` --- ## `llm_sampler_create()` **Returns:** `NULL` **Description:** Initializes a new sampling strategy for text generation. A sampler is the mechanism that determines how the model selects the next token (word or subword) during text generation. If no sampler is explicitly created, one will be created automatically when needed. **Example:** ```sql SELECT llm_sampler_create(); ``` --- ## `llm_sampler_free()` **Returns:** `NULL` **Description:** Frees resources associated with the current sampler. **Example:** ```sql SELECT llm_sampler_free(); ``` --- ## `llm_lora_load(path TEXT, scale REAL)` **Returns:** `NULL` **Description:** Loads a LoRA adapter from the given file path with a mandatory scale value. LoRA (Low-Rank Adaptation) is a technique to inject trainable, low-rank layers into a pre-trained model. **Example:** ```sql SELECT llm_lora_load('./adapters/adapter.lora', 1.0); ``` --- ## `llm_lora_free()` **Returns:** `NULL` **Description:** Unloads any currently loaded LoRA adapter. **Example:** ```sql SELECT llm_lora_free(); ``` --- ## `llm_sampler_init_greedy()` **Returns:** `NULL` **Description:** Configures the sampler to use greedy decoding (always pick most probable token). **Example:** ```sql SELECT llm_sampler_init_greedy(); ``` --- ## `llm_sampler_init_dist(seed INT)` **Returns:** `NULL` **Description:** Initializes a random distribution-based sampler with the given seed. If a seed value in not specified, a default 0xFFFFFFFF value will be used. **Example:** ```sql SELECT llm_sampler_init_dist(42); ``` --- ## `llm_sampler_init_top_k(k INT)` **Returns:** `NULL` **Description:** Limits sampling to the top `k` most likely tokens. Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751 **Example:** ```sql SELECT llm_sampler_init_top_k(40); ``` --- ## `llm_sampler_init_top_p(p REAL, min_keep INT)` **Returns:** `NULL` **Description:** Top-p sampling retains tokens with cumulative probability >= `p`. Always keeps at least `min_keep` tokens. Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751 **Example:** ```sql SELECT llm_sampler_init_top_p(0.9, 1); ``` --- ## `llm_sampler_init_min_p(p REAL, min_keep INT)` **Returns:** `NULL` **Description:** Like top-p but with a minimum token probability threshold `p`. Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841 **Example:** ```sql SELECT llm_sampler_init_min_p(0.05, 1); ``` --- ## `llm_sampler_init_typical(p REAL, min_keep INT)` **Returns:** `NULL` **Description:** Typical sampling prefers tokens near the expected entropy level. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666 **Example:** ```sql SELECT llm_sampler_init_typical(0.95, 1); ``` --- ## `llm_sampler_init_temp(t REAL)` **Returns:** `NULL` **Description:** Adjusts the sampling temperature to control randomness. **Example:** ```sql SELECT llm_sampler_init_temp(0.8); ``` --- ## `llm_sampler_init_temp_ext(t REAL, delta REAL, exponent REAL)` **Returns:** `NULL` **Description:** Advanced temperature control using exponential scaling. Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772 **Example:** ```sql SELECT llm_sampler_init_temp_ext(0.8, 0.1, 2.0); ``` --- ## `llm_sampler_init_xtc(p REAL, t REAL, min_keep INT, seed INT)` **Returns:** `NULL` **Description:** Combines top-p, temperature, and seed-based sampling with a minimum token count. XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335 **Example:** ```sql SELECT llm_sampler_init_xtc(0.9, 0.8, 1, 42); ``` --- ## `llm_sampler_init_top_n_sigma(n REAL)` **Returns:** `NULL` **Description:** Limits sampling to tokens within `n` standard deviations. Top n sigma sampling as described in academic paper "Top-nσ: Not All Logits Are You Need" https://arxiv.org/pdf/2411.07641 **Example:** ```sql SELECT llm_sampler_init_top_n_sigma(1.5); ``` --- ## `llm_sampler_init_mirostat(seed INT, tau REAL, eta REAL, m INT)` **Returns:** `NULL` **Description:** Initializes Mirostat sampling with entropy control. Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words. **Example:** ```sql SELECT llm_sampler_init_mirostat(42, 5.0, 0.1, 100); ``` --- ## `llm_sampler_init_mirostat_v2(seed INT, tau REAL, eta REAL)` **Returns:** `NULL` **Description:** Mirostat v2 entropy-based sampling. Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words. **Example:** ```sql SELECT llm_sampler_init_mirostat_v2(42, 5.0, 0.1); ``` --- ## `llm_sampler_init_grammar(grammar_str TEXT, grammar_root TEXT)` **Returns:** `NULL` **Description:** Constrains output to match a specified grammar. Grammar syntax described in https://github.com/ggml-org/llama.cpp/tree/master/grammars **Example:** ```sql SELECT llm_sampler_init_grammar('...BNF...', 'root'); ``` --- ## `llm_sampler_init_infill()` **Returns:** `NULL` **Description:** Enables infill (prefix-suffix) mode for completions. **Example:** ```sql SELECT llm_sampler_init_infill(); ``` --- ## `llm_sampler_init_penalties(n INT, repeat REAL, freq REAL, present REAL)` **Returns:** `NULL` **Description:** Applies repetition, frequency, and presence penalties. **Example:** ```sql SELECT llm_sampler_init_penalties(64, 1.2, 0.5, 0.8); ``` --- ## `llm_token_count(text TEXT)` **Returns:** `INTEGER` **Description:** Returns how many tokens the current model would consume for the supplied `text`, using the active context’s vocabulary. Requires a context created via `llm_context_create`. **Example:** ```sql SELECT llm_token_count('Hello world!'); -- 5 ``` --- ## `llm_embed_generate(text TEXT, options TEXT)` **Returns:** `BLOB` or `TEXT` **Description:** Generates a text embedding as a BLOB vector, with optional configuration provided as a comma-separated list of key=value pairs. By default, the embedding is normalized unless `normalize_embedding=0` is specified. If `json_output=1` is set, the function returns a JSON object instead of a BLOB. **Example:** ```sql SELECT llm_embed_generate('hello world', 'json_output=1'); ``` --- ## `llm_text_generate(text TEXT, [image1, image2, ...], options TEXT)` **Returns:** `TEXT` **Description:** Generates a full-text completion based on input, with optional configuration provided as a comma-separated list of key=value pairs. When a vision model is loaded via `llm_vision_load()`, you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF. **Examples:** ```sql -- Text-only generation SELECT llm_text_generate('Once upon a time', 'n_predict=1024'); -- Vision: describe an image SELECT llm_text_generate('Describe this image', './photos/cat.jpg'); -- Vision: compare multiple images SELECT llm_text_generate('What is different between these images?', './img1.jpg', './img2.jpg'); -- Vision: image from BLOB column SELECT llm_text_generate('What do you see?', image_data) FROM photos WHERE id = 1; ``` --- ## `llm_chat(prompt TEXT)` **Returns:** `VIRTUAL TABLE` **Description:** Streams a chat-style reply one token per row. **Example:** ```sql SELECT reply FROM llm_chat('Tell me a joke.'); ``` --- ## `llm_chat_create()` **Returns:** `TEXT` **Description:** Starts a new in-memory chat session. Returns unique chat UUIDv7 value. If no chat is explicitly created, one will be created automatically when needed. **Example:** ```sql SELECT llm_chat_create(); ``` --- ## `llm_chat_free()` **Returns:** `NULL` **Description:** Ends the current chat session. **Example:** ```sql SELECT llm_chat_free(); ``` --- ## `llm_chat_save(title TEXT, meta TEXT)` **Returns:** `TEXT` **Description:** Saves the current chat session with optional title and meta into the ai_chat_history and ai_chat_messages tables and returns a UUID. **Example:** ```sql SELECT llm_chat_save('Support Chat', '{"user": "Marco"}'); ``` --- ## `llm_chat_restore(uuid TEXT)` **Returns:** `NULL` **Description:** Restores a previously saved chat session by UUID. **Example:** ```sql SELECT llm_chat_restore('b59e...'); ``` --- ## `llm_chat_respond(text TEXT, [image1, image2, ...])` **Returns:** `TEXT` **Description:** Generates a context-aware reply using chat memory, returned as a single, complete response. For a streaming model reply, use the llm_chat virtual table. When a vision model is loaded via `llm_vision_load()`, you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF. **Examples:** ```sql -- Text-only chat SELECT llm_chat_respond('What are the most visited cities in Italy?'); -- Vision: ask about an image SELECT llm_chat_respond('What is in this photo?', './photos/landscape.jpg'); -- Vision: multiple images SELECT llm_chat_respond('Compare these two charts', './chart1.png', './chart2.png'); ``` --- ## `llm_chat_system_prompt(text TEXT)` **Returns:** `TEXT` or `NULL` **Description:** Gets or sets the system prompt for chat sessions. When called without arguments, returns the current system prompt (or `NULL` if none is set). When called with a text argument, sets the system prompt and returns `NULL`. The system prompt is automatically prepended as a system-role message at the beginning of chat conversations. **Example:** ```sql -- Set a system prompt SELECT llm_chat_system_prompt('You are a helpful assistant that speaks Italian.'); -- Get the current system prompt SELECT llm_chat_system_prompt(); ``` --- ## Vision Functions ### `llm_vision_load(path TEXT, options TEXT)` **Returns:** `NULL` **Description:** Loads a multimodal projector (mmproj) model for vision capabilities. This requires a text model to already be loaded via `llm_model_load()`. The mmproj file is a separate GGUF file that contains the vision encoder and projector weights. Once loaded, vision capabilities are available through `llm_text_generate()` and `llm_chat_respond()` by passing image arguments. The following option keys are available: | Key | Type | Default | Meaning | | ------------------ | --------------------------------- | ------- | -------------------------------------------------------------------- | | `use_gpu` | `1 or 0` | `1` | Use GPU for vision encoding. | | `n_threads` | `number` | `4` | Number of threads for vision processing. | | `warmup` | `1 or 0` | `1` | Run a warmup pass on load for faster first use. | | `flash_attn_type` | `auto, disabled, enabled` | `auto` | Controls Flash Attention for the vision encoder. | | `image_min_tokens` | `number` | `0` | Minimum image tokens for dynamic resolution models (0 = from model). | | `image_max_tokens` | `number` | `0` | Maximum image tokens for dynamic resolution models (0 = from model). | **Example:** ```sql -- Load text model first SELECT llm_model_load('./models/Gemma-3-4B-IT-Q4_K_M.gguf', 'gpu_layers=99'); SELECT llm_context_create_textgen(); -- Load vision projector SELECT llm_vision_load('./models/mmproj-Gemma-3-4B-IT-f16.gguf'); -- Now use vision with llm_text_generate or llm_chat_respond SELECT llm_text_generate('Describe this image', './photos/cat.jpg'); ``` --- ### `llm_vision_free()` **Returns:** `NULL` **Description:** Unloads the current vision (mmproj) model and frees associated memory. The text model remains loaded. **Example:** ```sql SELECT llm_vision_free(); ``` --- ## Audio Functions ### `audio_model_load(path TEXT, options TEXT)` **Returns:** `NULL` **Description:** Loads a Whisper model from the specified file path with optional comma-separated key=value configuration. The model is used for audio transcription via `audio_model_transcribe`. Only one whisper model can be loaded at a time per connection. **Example:** ```sql -- Load with defaults SELECT audio_model_load('./models/ggml-tiny.bin'); -- Load with options SELECT audio_model_load('./models/ggml-base.bin', 'gpu_layers=0'); ``` --- ### `audio_model_free()` **Returns:** `NULL` **Description:** Unloads the current Whisper model and frees associated memory. **Example:** ```sql SELECT audio_model_free(); ``` --- ### `audio_model_transcribe(input TEXT/BLOB, options TEXT)` **Returns:** `TEXT` **Description:** Transcribes audio to text using the loaded Whisper model. The input can be either: - **TEXT**: A file path to an audio file (WAV, MP3, or FLAC) - **BLOB**: Raw audio data (format auto-detected from magic bytes) An optional second parameter accepts comma-separated key=value pairs to configure transcription behavior. Supported audio formats: WAV, MP3, FLAC. Audio is automatically converted to mono 16kHz PCM as required by Whisper. **Transcription options:** | Key | Type | Default | Meaning | | ------------------ | -------- | ------- | ---------------------------------------------------------- | | `language` | `text` | `en` | Language code (e.g., `en`, `it`, `fr`, `de`). | | `translate` | `1 or 0` | `0` | Translate to English. | | `n_threads` | `number` | `4` | Number of threads for decoding. | | `offset_ms` | `number` | `0` | Start transcription at this offset (milliseconds). | | `duration_ms` | `number` | `0` | Transcribe only this duration (0 = full audio). | | `no_timestamps` | `1 or 0` | `0` | Suppress timestamps in output. | | `single_segment` | `1 or 0` | `0` | Force single segment output. | | `token_timestamps` | `1 or 0` | `0` | Enable token-level timestamps. | | `initial_prompt` | `text` | | Initial prompt to guide the model. | | `temperature` | `float` | `0.0` | Sampling temperature. | | `beam_size` | `number` | `-1` | Beam search size (-1 = use default). | | `audio_ctx` | `number` | `0` | Audio context size (0 = use default). | | `suppress_regex` | `text` | | Regex pattern for suppressing tokens. | | `max_len` | `number` | `0` | Maximum segment length in characters (0 = no limit). | | `print_timestamps` | `1 or 0` | `0` | Include timestamps in transcribed text. | **Examples:** ```sql -- Transcribe from a file path SELECT audio_model_transcribe('./audio/speech.wav'); -- Transcribe from a BLOB column SELECT audio_model_transcribe(audio_data) FROM recordings WHERE id = 1; -- Transcribe with options SELECT audio_model_transcribe('./audio/speech.mp3', 'language=it,translate=1'); -- Transcribe a single segment with no timestamps SELECT audio_model_transcribe('./audio/clip.flac', 'single_segment=1,no_timestamps=1'); ``` --- ## Model Metadata These functions return internal model properties: ```sql SELECT llm_model_n_params(), llm_model_size(), llm_model_n_ctx_train(), llm_model_n_embd(), llm_model_n_layer(), llm_model_n_head(), llm_model_n_head_kv(), llm_model_n_swa(), llm_model_rope_freq_scale_train(), llm_model_n_cls_out(), llm_model_cls_label(), llm_model_desc(), llm_model_has_encoder(), llm_model_has_decoder(), llm_model_is_recurrent(), llm_model_chat_template(); ``` All return `INTEGER`, `REAL`, or `TEXT` values depending on the property. ---