This release introduces per-column bloom filter configuration, allowing fine-grained control over bloom filter acceleration through CREATE TABLE and ALTER TABLE statements. It also includes bug fixes for a rare backup failure caused by empty partitions and a rare replica crash during graceful shutdown.
QuestDB Releases
Latest updates and improvements to QuestDB Open Source and Enterprise editions.
Showing 30 releases
QuestDB 9.3.5 introduces lateral joins, SQL-standard UNNEST, statistical window functions, and corrects SAMPLE BY timezone handling during DST transitions. It also delivers multiple join performance improvements and important Parquet export fixes.
Breaking Changes
This fix introduces a new
timestamp_floor_utcfunction that floors timestamps in local time and converts back to UTC internally, replacing the previous approach of wrapping the query in an extrato_utc()conversion model. For sub-day strides, it uses the standard (non-DST) timezone offset to keep bucket widths uniform in UTC space, avoiding ambiguity during fall-back transitions. DST fall-back (clocks go back) now produces two rows instead of one for the repeated local hour — previously both passes through the repeated hour were merged into a single bucket. Output timestamps can be non-monotonic in local time during fall-back, though the underlying UTC timestamps remain monotonic. Sub-day bucket boundaries are now uniform in UTC rather than in local time. These changes do not affect queries withoutTIME ZONE, queries with fixed-offset timezones, or super-day strides (day, week, month, year). The fix also corrects an offset-in-floor-anchor bug where DST-aware and fixed-offset paths produced different bucket assignments, fixes the materialized view refresh iterator for sub-day DST timezones with offset, and resolves a native memory leak inFillRangeRecordCursorFactory.
New Features
Users can now declare bloom filters as part of column metadata via the existing
PARQUET()clause inCREATE TABLEandALTER TABLEstatements. Previously, bloom filter columns had to be specified each time a partition was converted to parquet. ThePARQUET()clause accepts an optional trailingBLOOM_FILTERkeyword, which can be used as the sole argument, combined with encoding, or with both encoding and compression (e.g.,PARQUET(DELTA_BINARY_PACKED, ZSTD(3), BLOOM_FILTER)). The bloom filter flag is stored in bit 25 of the existing 32-bitparquetEncodingConfigfield, requiring no on-disk format change. WhenconvertPartitionNativeToParquetruns without an explicitbloom_filter_columnsoverride, theTableWriterscans per-column metadata for the bloom filter flag automatically. An explicitbloom_filter_columnsinCONVERT PARTITION WITH(...)still overrides metadata flags.SHOW CREATE TABLErenders the flag in lowercase (bloom_filter).SET PARQUET(...)replaces the entire parquet config for the column, so users must re-specifyBLOOM_FILTERwhen changing only the encoding. FPP (false positive probability) remains a global setting (partition.encoder.parquet.bloom.filter.fpp) and is not configurable per-column. The vendoredparquet-format-safecrate is patched to includebloom_filter_lengthonColumnMetaData, improving compatibility with readers that rely on this field to locate bloom filter boundaries.This feature introduces a full suite of statistical aggregate window functions.
STDDEV_POP(),STDDEV_SAMP(), andSTDDEV()compute standard deviation, whileVAR_POP(),VAR_SAMP(), andVARIANCE()compute variance, reusing the standard deviation base via anisSqrtflag.COVAR_POP(),COVAR_SAMP(), andCORR()provide bivariate statistical analysis through a new bivariate abstract base class. All functions support all frame modes includingROWS,RANGE,PARTITION BY, and unbounded/bounded frames, as well as two-pass whole-partition variants. Non-removable frames use Welford's online algorithm for numerical stability.STDDEVis an alias forSTDDEV_SAMP(), andVARIANCEis an alias forVAR_SAMP(). A fix to FunctionParser was also included to correctly resolve group-by vs window factory when an argument requires implicit cast.This feature introduces a new server property
cairo.metadata.cache.snapshot.ordered(default:false) that, when enabled, causes all table-listing functions to return rows sorted alphabetically by table name. The sort is maintained incrementally inside a newCharSequenceObjSortedHashMapdata structure backed by aCharSequenceSortedList, which inserts at the correct sorted position using binary search. This means lookup remains O(1) via hash and ordered iteration is O(n), with no post-query sort step required. A newCharSequenceObjMapinterface allows callers to hold either the sorted or unsorted implementation transparently. The feature applies toSHOW TABLES,all_tables(), and related table-listing functions, and correctly handles drop-and-recreate as well as rename scenarios.Tick expressions like
[2024-01, 2024-02]T09:30@America/New_York#workday;6h29mnow work without requiring explicit day ranges ([2024-01-[01..31]]). The compiler detects month-level and year-level date elements with time override suffixes and expands them to individual days before applying the time. This also works with bracket ranges (2024-[01..02]T09:30), time list brackets (T[09:00,14:00]), bare expressions without brackets (2024-01T09:30), year-level dates ([2024]T09:30), and duration plus day filter combinations ([2024-01]#workday;6h29m). Month validation was added beforegetDaysPerMonthto preventArrayIndexOutOfBoundsExceptionon invalid month values. AparseMonthLevelDatehelper was extracted to deduplicate YYYY-MM parsing and validation. Heap allocation oflong[] dayStartswas eliminated by compacting day starts to the front of the existingLongListand reusing positions after them for per-day parsing.This feature introduces
UNNESTas aFROM-clause operator supporting all typed array column types, multiple arrays with NULL padding, andWITH ORDINALITY. For JSON arrays, aCOLUMNS(name TYPE, ...)syntax declares output columns from VARCHAR-stored JSON arrays. Supported JSON column types include DOUBLE, LONG, INT, SHORT, BOOLEAN, VARCHAR, and TIMESTAMP.UnnestRecordCursorFactorywraps the base factory and emits one output row per array or JSON element. TheUnnestSourceinterface abstracts the difference between typed arrays (ArrayUnnestSource) and JSON arrays (JsonUnnestSource), keeping cursor and record logic shared. JSON field extraction uses simdjson via JSON Pointer queries, and a nativetruncatedflag detects values exceeding the 4KB extraction limit, throwing an error instead of silently truncating. Example usage:SELECT u.price, u.name FROM events e, UNNEST(e.payload COLUMNS(price DOUBLE, name VARCHAR)) u. Typed array and JSON sources can be mixed in the sameUNNESTcall.This feature adds lateral join support, allowing subqueries in the
FROMclause to reference columns from preceding tables. Correlated lateral subqueries are decorrelated at the optimizer into standard joins, enabling set-based execution instead of per-row nested-loop evaluation. The decorrelation technique is based on the Neumann and Kemper "Unnesting Arbitrary Queries" approach. The rewriter (LateralJoinRewriter) runs three passes during SQL optimization: correlation analysis that tags literals referencing outer tables, decorrelation that builds deduplicated outer-reference subqueries and rewrites correlated references, and an elimination pass that attempts to remove outer-reference join models when all correlations resolve to equalities. Operator-specific compensation preserves semantics forGROUP BY,SAMPLE BY, window functions,DISTINCT,LATEST BY,LIMIT, and set operations. Supported syntax includesJOIN LATERALfor inner lateral joins,LEFT JOIN LATERALfor left lateral joins with NULL fill, and standaloneLATERALfor implicit cross joins. Example:SELECT o.id, t.total FROM orders o JOIN LATERAL (SELECT sum(qty) AS total FROM trades WHERE order_id = o.id) t.ALTER TABLE ADD COLUMNandALTER TABLE DROP COLUMNnow work correctly on tables whose partitions have been converted to Parquet format. Previously, schema changes on Parquet-stored partitions could produce corrupted files or query errors because the Parquet column layout was assumed to match the current table schema. Added columns read as all-NULL at read time, while dropped columns' data remains in the file until an O3 merge rewrites the affected row groups. During O3 merge, rewritten row groups reflect the current schema: added columns get all-NULL chunks, and dropped columns are omitted. Bitmap indexes are rebuilt when converting between native and Parquet formats if the index files are missing or stale. The WAL writer correctly marks new symbol columns as nullable when uncommitted rows exist at the time ofADD COLUMN.This feature extends
HORIZON JOINto accept multiple right-hand-side (slave) tables in a single query, enabling users to aggregate columns from several time-series sources against a common master table and offset grid in one statement. Both single-threaded and parallel (page-frame-based) execution paths are supported for multi-slaveHORIZON JOIN, including keyed (ONsymbol) and non-keyed (timestamp-onlyASOF) variants, as well as mixed keyed/non-keyed slaves within the same query. The lastHORIZON JOINin the chain carries theRANGE/LISTandASclauses; precedingHORIZON JOINclauses omit them. For example:SELECT avg(b.bid) AS avg_bid, avg(a.ask) AS avg_ask FROM trades AS t HORIZON JOIN bids AS b ON (t.sym = b.sym) HORIZON JOIN asks AS a ON (t.sym = a.sym) LIST (-2s, 0, 2s) AS h GROUP BY h.offset. Internal refactoring replacesLongListoffsets withlong[]to avoid on-heap allocations on the hot path, and replaces the always-forward scan strategy with an adaptive backward/forward approach controlled by configurable thresholds (cairo.sql.horizon.join.bwd.scan.*properties).This feature adds RLE dictionary decoding for STRING columns in the Parquet reader by wiring the existing
BaseVarDictDecoder,RleDictionarySlicer, andStringColumnSinktogether indecode_byte_array_dispatch. The default writer encoding for VARCHAR columns has been changed fromRleDictionarytoDeltaLengthByteArray, aligning it with STRING and Binary defaults.This feature introduces the
arg_max(varchar, key)aggregate function for key types:timestamp,double,long, andint. The function returns the varchar value corresponding to the row where the key column reaches its maximum. Each variant supports parallel group-by execution (Async Group By) with per-worker function cloning and pointer-based merge, usingStableAwareUtf8StringHolderfor efficient off-heap varchar storage that avoids copying on every new max. NULL key rows are skipped and do not affect the result, while a NULL varchar value at the max key is correctly returned as NULL.
Improvements
This improvement extends the
convertSymbolJoinKeysToIntoptimization, previously used only by AsOf/LT joins, to hash joins. StaticSYMBOL-to-SYMBOLkey pairs are now compared as integers rather than strings, with aSymbolTranslatingRecordwrapping the build-side record to translate build-side symbol IDs into the probe-side encoding. Translation runs only on the build side, so the probe side stays on a straightgetInt()path with zero added overhead. All six hash join variants (inner/left/right/full outer × light/full-fat) are covered.Queries on wide tables that reference only a subset of columns now open partitions faster.
TableReadertracks which columns the current query needs via aBitSetand skips memory-mapping inactive columns when opening partitions, reducing mmap/munmap syscall overhead proportionally to the number of unreferenced columns. For a query touching 1 column out of 100, roughly 99% of mmap calls are avoided per partition open. The active column set is deduplicated and falls back to mapping all columns when the list is null, empty, or covers every column. When the active set broadens, already-open partitions map any newly-needed columns. On pool return,goPassive()clears the active column state so subsequent opens map all columns. When no active columns are set (Parquet export, direct callers), the reader maps all columns as before.This improvement eliminates eager opening of all partitions at query start for
ASOF JOIN,HORIZON JOIN, andWINDOW JOINqueries on large slave tables. Time frame cursors now pre-compute exact page frame boundaries from table metadata (column tops, row counts, partition formats) without opening partitions. Actual partition opening happens lazily on first access viaensurePartitionOpened(), so queries targeting a narrow time range skip I/O for untouched partitions. Both single-threaded (TimeFrameCursorImpl) and concurrent (ConcurrentTimeFrameState+ConcurrentTimeFrameCursorImpl) cursors follow a two-phase approach: an upfront phase that pre-computes page frame boundaries from partition metadata stored asUninitializedPageFrameentries, and a lazy phase that opens partitions and patches zero-address entries with real mmap addresses. Fallback paths that bypass lazy opening include partitions already open in the table reader, cursors with interval filters, and Parquet partitions. The concurrent path uses double-checked locking withAtomicIntegerArrayfor safe partition opening across worker cursors. Queries accessing a subset of partitions avoid opening irrelevant partitions entirely, with zero-GC on the hot path.Previously,
calculateInsertTransactionBlock()excluded the last loadedINSERTfrom the transaction block, forcing it to be processed alone. This improvement includes it in the block, allowing consecutiveINSERToperations to be committed together viaprocessWalCommitBlock()instead of one-by-one viaprocessWalCommit(). The previous behavior caused unnecessary LAG usage on empty tables (e.g., post-TRUNCATEwith few rows), creating artificial 0-row partitions that could race with backup compression. The root cause was a backward loop that assignedFORCE_FULL_COMMITto the last loadedINSERT, and the block calculation loop broke before incrementingblockSize, excluding the transaction. The fix incrementsblockSizebefore the break, which is safe because the backward loop propagatesFORCE_FULL_COMMITfrom structural changes (ALTER,TRUNCATE) to theINSERTbefore them, so the block loop never reaches a non-data transaction.ASOF JOINandLT JOINfactories with multi-key symbol columns now compare join keys as integers instead of converting symbol IDs to strings.SymbolTranslatingRecordtranslates master symbol IDs to slave symbol IDs via a cachedIntIntHashMap, enabling integer-based map lookups andmemeq()comparisons instead of variable-length string hashing and comparison. This optimization applies toAsOfJoinFastRecordCursorFactory,AsOfJoinDenseRecordCursorFactory,AsOfJoinLightRecordCursorFactory,FilteredAsOfJoinFastRecordCursorFactory, andLtJoinLightRecordCursorFactory. Single-symbol joins retain their existing specialized paths.SymbolTranslatingRecordgains ahadNonExistentKey()flag set duringgetInt()calls, eliminating a separatehasNonExistentKey()pre-scan. Benchmarks on a 100M master / 50M slave row dataset show multi-keyASOF JOINimproving from 8.11s to 5.2s.This improvement speeds up constant-index element access on 1D and 2D double array columns (e.g.
arr[1],arr[3],arr[2,1]) by bypassing fullArrayViewconstruction. The hot path now reads directly from AUX/data pages viaUnsafe, skippingIntListoperations, stride computation, andBorrowedFlatArrayViewsetup thatBorrowedArray.of()performs per row. A newRecord.getArrayDouble1d2d()method with an optimized override inPageFrameMemoryRecordtakes two zero-based indices and dispatches internally between the 1D and 2D paths with a singleUnsafe.getDoublereturn.DoubleArrayAccessFunctionFactorydetects constant positive indices on 1D/2D column functions at compile time and routes through the fast path. Negative indices and 3D+ arrays fall through to the existinggetArraypath. Benchmarks on a 10M-row table show approximately 2.4x speedup for queries likeSELECT sum(arr[1]) FROM t.This improvement adds a table selector to the table details drawer header when there is no table details history, reusing the same selector as the metrics table selector. It also includes styling updates for the metrics header (table name, title, actions) and layout updates to the metrics dashboard.
Bug Fixes
After
DROP COLUMN, a column's reader index (dense position in the live column list) diverges from its writer index (permanent ID stored asfield_idin parquet files). Several code paths confused the two, producing type mismatch errors, corrupt bitmap indexes, or suspended WAL tables. This fix addresses five distinct issues: bitmap index rebuild using the wrong parquet column during checkpoint/backup recovery; a doubled parquet file path that silently skipped bitmap index rebuilds; a race condition where parallel parquet bitmap rebuild tasks read stale metadata when recovering multiple tables;COPY TOparquet export flagging the wrong column as designated timestamp afterDROP COLUMN; andALTER TABLE ALTER COLUMN TYPEcrashing on tables with parquet partitions by converting parquet partitions back to native before the column type conversion starts.This fix allows signed duration segments when parsing and compiling tick expressions (e.g.,
timestamp in '2026-01;-3d'). Intervals are now normalized when durations move backwards so that static and compiled paths produce matching results. The compiled path inemitSingleVarpreviously produced inverted intervals for non-day-level variables like$now;-1h. Additionally,addYearsday-of-month clamping was corrected in Micros, Nanos, and Dates so that Feb 29 + 1y now correctly produces Feb 28 instead of Mar 1. Calendar drift in repeating month/year intervals is fixed by computing each interval from the base timestamp instead of iterating. Non-positive counts in repeating interval syntax are now rejected with a clear error message.The projection self-reference check in
doReplaceLiteral0()only inspected the first join model, missing columns from other join models. This caused "Invalid column" errors when a function wraps a column from a joined table (e.g.,coalesce(c, 0)whereccomes from a secondary join model) and table-prefixed columns make the translating model non-redundant. This fix replaces the single-model check withcolumnNotExistsInJoinModels()that scans all join models.This fix addresses three related issues. First,
CopyExportRequestTaskincorrectly treated var-size columns as fullyNULLwhen their data was entirely inlined into the aux vector, because the colTop heuristic checked the data page address (which is legitimately0for inlined varchars). It now checks the aux page address for var-size columns. Second, the streaming Parquet reader now supports multiple dictionary pages per column chunk by allocating a fresh buffer per varchar-slice dict page, preventing aux entries decoded against earlier dict pages from being corrupted when later ones are decompressed. Third, the ASCII flag is now normalized when serializingVARCHARkeys inSingleRecordSinkandOrderedMap, so two equal varchars whose ASCII-flag provenance differs hash and compare as the same key. Without this normalization,GROUP BY,DISTINCT, and hash-join operations could treat visually identical values as distinct groups.This fix corrects a bug in
convertSymbolJoinKeysToInt()wherewriteSymbolAsStringbits were unset while iterating over join key columns. When master and slave symbol columns had crossing indices (e.g.,master.symbol1=col2,slave.symbol2=col2), the first pair's unset operation cleared the bits needed by the second pair, leavingkeyTypesasSTRINGwhile theRecordSinkgeneratedgetInt()code. This type mismatch produced garbled map keys and missed matches.When
DirectLongLongHashMap.restoreInitialCapacity()orrehash()encountered an allocation failure (OOM or RSS memory limit exceeded), the map was left in an inconsistent state wherecapacity > 0butptr = 0. A subsequentFastGroupByAllocator._close()call would iterate over the capacity and dereference the null pointer, causing a SIGSEGV. This fix movescapacity,mask, andptrupdates to after the allocation succeeds, so a failedmallocleaves the map in its prior consistent state.The
ALTER TABLE ALTER COLUMN TYPEstatement now supports converting betweenDECIMALandVARCHAR/STRINGtypes. The type conversion validation matrix inSqlCompilerImplwas updated to include these entries. Four new conversion methods handle the data transformation:DECIMALtoVARCHAR/STRINGreads each value through a loader, sets the scale on a thread-localDecimal256, and writes the string representation;VARCHAR/STRINGtoDECIMALparses each string throughDecimal256.ofString()and stores the result. NULL values round-trip correctly in both directions. All sixDECIMALstorage sizes (DECIMAL8throughDECIMAL256) are supported for both conversion directions. The fix also correctsVARCHARaux-vector sizing in the shared fixed-to-VARCHARconversion path.This fix resolves an issue where clients using the PostgreSQL extended query protocol (such as postgres.js) to iterate cursors in batches via unnamed portals would receive a "spurious execute message" error on the second Execute. The root cause was that after a Flush, the suspended cursor was freed for unnamed portals, the factory was moved into the cache, and the pipeline entry was released to the object pool, leaving
pipelineCurrentEntrynull when the next Execute arrived. The fix retains suspended unnamed portal pipeline entries across Flush/Sync so the next Execute can resume iteration, adds astateSuspendedflag to track portal suspension state accurately, and frees suspended cursors on abandonment (new Parse/Bind) or explicit Close. Named portals (used by JDBC'ssetFetchSize) are unaffected as they already survived via thenamedPortalsmap.Go's JSON marshaller omits the fractional part entirely when a time value has zero sub-second precision, producing values like
2026-03-31T09:02:28Zinstead of2026-03-31T09:02:28.000000000Z. This broke QuestDB timestamp patterns such as.U+and.N+, which previously required a dot and at least one digit. This fix normalizes the fractional component into optional fraction opcodes in the micros and nanos timestamp compilers so.U+and.N+accept either fractional digits or no fractional part at all, while still rejecting a bare dot. Both the generic and ASM parser paths share the relaxed behavior.This fix resolves an issue where
copyOrRebuildColumnIndexes()skipped columns only whencolTop == -1, butrebuildPartitionIndexFiles()also skips whencolTop >= partitionRowCount(column has no data). When a symbol column was added after a Parquet partition existed, O3 merge setcolTop == partitionRowCount. The Parquet-to-native conversion correctly skipped building index files, but the subsequent native-to-Parquet conversion tried to hard-link the non-existent.k/.vfiles, suspending the table. The skip condition incopyOrRebuildColumnIndexes()was aligned withrebuildPartitionIndexFiles()by also skipping whencolTop >= partitionRowCount.When a backup checkpoint was in progress, the backup read partition data and column files from the live database directory. If partition directories or column version files were deleted while the checkpoint was snapshotting metadata, the backup failed because it tried to read files that no longer existed. This fix adds checkpoint guards to all synchronous and asynchronous file removal paths to prevent deletion while a checkpoint is active.
TableWriter.processPartitionRemoveCandidates0now checksisInProgress()and defers old partition directory removal to async purge when a checkpoint is active, while rollback orphans bypass the guard since they are not in any committed snapshot.TableWriter.finishColumnPurgeforces async-only column version purge during checkpoint to prevent synchronous deletion of.d/.ifiles. The same checkpoint guard is applied for column purge triggered byALTER COLUMN CONVERTandUPDATEoperations. The async purge job (O3PartitionPurgeJobandColumnPurgeOperator) also skips file deletion during checkpoint, closing the race where the purge job runs before the checkpoint pins a table's scoreboard.This fix resolves an issue where
CopyExportFactory.getCursor()enqueues export work onto a ring queue that a worker thread picks up, butBaseParquetExporter.of()constructed the worker'sSqlExecutionContextwith a nullBindVariableService. This caused a "bind variable service is not provided" error when bind variables were used in theCOPYsubquery (e.g.COPY (SELECT * FROM t WHERE ts <= $1) TO ...). The fix threads the caller'sBindVariableServicethroughCopyExportRequestTaskso the worker's execution context receives it. Bind variable values are deep-copied viaBindVariableServiceImpl.snapshot()before enqueueing the task, since the PostgreSQL Wire Protocol thread may clear or repopulate theBindVariableServicefor the next query before the worker starts. The HTTP export path keeps the direct reference since it runs synchronously on the same thread.This fix resolves a
NullPointerExceptionthat occurred whenWalPurgeJobattempted to fetch metadata for a table that was concurrently dropped. A concurrent drop could close the metadata pool tenant'stxFileduring refresh, causing aNullPointerExceptionrather than aCairoException. The fix updatesWalPurgeJob.fetchSequencerPairs()to catchThrowable(not justCairoException) whengetTableMetadata()fails, and similarly updatesAbstractMultiTenantPool.get0()in thenewTenantpath so the pool slot is always released on failure.This fix corrects an off-by-one error in
O3PartitionJob.mergeRowGroup()that could produce row groups exceeding the 1.5xmaxRowGroupSizelimit by 1 row. ThenumChunksformula used3 * maxRowGroupSizeas the divisor, representing2 * (1.5 * maxRowGroupSize)in real arithmetic. WhenmaxRowGroupSizeis odd, integer truncation causes a mismatch:3 * 551 = 1653vs2 * (551 + 551/2) = 2 * 826 = 1652. This difference meant that for certain merge totals, the formula computed fewer chunks than needed, and the even-distribution logic gave the first chunk more rows than the 1.5x limit allowed. The fix computesnumChunksfrom the integer-exactmaxChunkTarget(maxRowGroupSize + maxRowGroupSize / 2), ensuring no chunk exceeds the 1.5x bound after remainder distribution.The
validateWindowJoinsfunction only recursed intonestedModelandunionModel, missingWINDOW JOINdefinitions inside join model subqueries. This leftWindowJoinContextuninitialized (lo/hi not computed), causing aggregates to return error results when aWINDOW JOINappeared under another join, such as inside a subquery used as aJOINsource.The
propagateTopDownColumns0()method emits join-condition columns to each join model, but the expression nodes carried the outer query's table aliases. When the join model was a view, those aliases could not be resolved within the view's own scope, soaddTopDownColumnsilently dropped them. The optimizer then pruned the column from the view's output, causing anInvalidColumnExceptionat code-generation time when the join tried to look up the missing column. This fix switches to using plain column names without table-alias prefixes, which can be resolved in the join model's alias map.This fix replaces header click-to-insert with per-column copy buttons and tracks grid selection state to disable "move column to front" when nothing is selected. It resolves Monaco context menu being clipped by overflow containers, consolidates DDL queries to strip redundant blank lines, and adds truncation for long DDL in the table details drawer. Format shortcuts are rebound to Alt+Shift+F / Alt+F, toolbar and result bar spacing is tightened, long materialized view DDLs are truncated, and the LiteEditor icon now switches based on diff vs regular mode.
This release introduces new SQL capabilities including the COPY PERMISSIONS statement for cloning access profiles between principals, per-column Parquet encoding and compression controls, and dynamic window ranges in WINDOW JOIN queries. Performance improvements span faster ORDER BY on TCA queries, expanded vectorized GROUP BY coverage, and improved Parquet I/O throughput. The release also addresses several critical bug fixes including data corruption in DECIMAL columns, JVM crashes during backup coordination, and WAL replication race conditions.
New Features
This feature introduces three new permission constants:
CONVERT PARTITION TO PARQUET,CONVERT PARTITION TO NATIVE, andSET PARQUET ENCODING. Authorization checks are wired into all enterpriseSecurityContextimplementations, includingEntSecurityContextBase,AdminSecurityContext, andAbstractReplicaSecurityContext. ThePermissionParserwas updated to handle multi-word permission names containing SQL keywords (such asTOinCONVERT PARTITION TO PARQUET) by addingisPermissionPrefix()to disambiguate keywords from permission name continuations. All three operations are denied on replica security contexts. Previously, theseALTER TABLEoperations piggybacked onauthorizeAlterTableAlterColumnTypeand could not be granted or revoked independently.
Bug Fixes
The WAL uploader could enter an infinite retry loop when it encountered a segment whose
upload.pendingmarker file was never created. This occurred whenWalWriter.openNewSegment()advancedminSegmentLockedeven whencreateSegmentDir()threw due to a transient disk error. When the distressed WalWriter was later closed, asegmentClosedevent fired for a segment that never had itsupload.pendingfile created, causing the Rust uploader to retry deleting the non-existent file indefinitely. This fix makesclear_segment_pending_filein the Rust uploader treatENOENTas success, since the desired end state —upload.pendingabsent — is already achieved. This is safe with respect to the WAL cleaner because phantom segments are harmless: object stores return success when deleting non-existent keys, and the cleaner's compaction watermark has gap protection that prevents it from skipping real segments.In QuestDB Enterprise, exporting to Parquet via
COPYcould produce spurious "cancelled by user" errors. TheSecurityCheckFactorywas inserted as aRecordCursorFactorywrapper for allSELECTqueries, even though it only mattered forUPDATEqueries where it revalidated column-level permissions. This broke an assumption that the outermost factory for aSELECTis alwaysQueryProgress, causing the code to fail to unwrap the factory chain correctly.QueryProgressthen interacted badly with theCOPYjob, triggering the spurious errors. This fix moves authorization revalidation directly intoUpdateOperation, eliminating the need for theSecurityCheckFactorywrapper.After a server restart,
collect_missed_segmentswalks WAL directories to find segments withupload.pendingfiles left over from a previous run. Because the walk is processed asynchronously, it could encounter WAL directories created after the bounce and queue them as "closed" with a stalelast_txnset to the sequencer's startup transaction. This caused theupload.pendingclearing loop to remove the file prematurely, after whichWalPurgeJobdeleted the unprotected segment. The uploader, still needing the segment, entered an infinite retry loop and the table's replication became permanently stuck. This fix passescurrent_wal_idintocollect_missed_segmentsand skips any segment wherewal_id > current_wal_id. Sincecurrent_wal_idis read from_wal_index.dbefore ingestion starts, it reliably represents the last pre-bounce WAL ID. Post-bounce segments are handled through the normalSegmentClosedevent flow and do not need recovery viacollect_missed_segments.Without this fix, revoking a user's
ALTERorUPDATEpermission did not take effect for already-cached prepared statements in the PostgreSQL Wire Protocol, allowing the user to continue executing those operations until the connection was closed. Previously,ALTERstatements were not cached before the PostgreSQL Wire Protocol layer rewrite, so authorizing them at compile time only was sufficient. Since the PostgreSQL Wire Protocol layer can now cache these commands, this fix ensures that authorization happens at execution time as well.
QuestDB 9.3.4 delivers dynamic windows in WINDOW JOIN, Parquet row group pruning with bloom filters, new array functions, and significant performance improvements across ORDER BY, joins, and Parquet I/O.
Breaking Changes
This change improves query execution performance by folding constant sub-expressions at compile time rather than re-evaluating them on every row. For example,
x < '2026-01-01'::timestamppreviously kept the cast as an unevaluated function that executed per row, but now evaluates it once at compile time. Additionally, constant reassociation regroups constants of the same associative operator into a single subtree (e.g.,col + 1 + 4becomescol + 5). As a breaking change, float and doubleInfinityand-Infinityvalues are now treated as NULL at compile time. Previously, expressions likecast('Infinity' as float)remained unevaluated at runtime and returnedFloat.POSITIVE_INFINITY, but with constant folding these are now evaluated throughFloatConstant.newInstance()/DoubleConstant.newInstance(), which collapsesInfinity,-Infinity, andNaNto NULL per QuestDB's convention.CASEexpressions can no longer branch onInfinityor-Infinityas distinct float/double values.This change aligns STRING and SYMBOL ordering with the SQL standard byte-order collation and enables range filter pushdown for string columns in Parquet row group pruning. Any code relying on the previous UTF-16 code unit comparison order for string comparisons may observe different ordering results.
This change improves varchar decoding performance from parquet files by introducing a new internal column type: varchar slice.
VarcharSliceaux entries store (length, pointer) pairs pointing directly into mmapped Parquet pages or per-page decompression buffers, eliminating byte copies on the read path. The default encoding for Varchar has been changed fromDelta Length Byte ArraytoRLE Dictionary. Benchmarks show decode-only performance improving from ~4.25ms to ~0.75ms for short strings (8 bytes) at 500K rows with cardinality 256.
New Features
This feature adds
minTimestampandmaxTimestampTIMESTAMP columns tosys.telemetry_walto capture the data timestamp range per WAL transaction event. WAL telemetry is now enabled by default regardless of the main telemetry setting, reducing the dependency on logs to investigate data writing shape. WalWriter commit log messages have been downgraded from info to debug level unless the commit has a replace range. Schema migration support was also added: when the column count mismatches the expected schema, the table is dropped and recreated, which is safe given the 1-week TTL.This feature introduces
array_sort(DOUBLE[])andarray_reverse(DOUBLE[])scalar functions that operate on double arrays of any dimensionality.array_sortsorts each innermost-dimension slice independently, preserving the array's shape, and accepts optional boolean arguments for descending order and nulls-first placement.array_reversereverses each innermost-dimension slice. Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs, and support both contiguous unit-stride and non-vanilla array layouts via separate code paths. The internal sort buffer grows on demand and stays at peak size for the cursor's lifetime to avoid allocation churn on the hot path.This feature introduces four new SQL functions that operate element-wise across
DOUBLE[]arrays. Each function works in two modes: variadic (two or more array arguments, per-row) and aggregate (single array column,GROUP BY/SAMPLE BY). The functions support full N-dimensional arrays with automatic shape broadcasting, where the output shape is the per-dimension maximum of all inputs. NULL arrays and NaN elements are skipped, and positions that receive no finite values yield null.array_elem_sum()andarray_elem_avg()use Kahan compensated summation for floating-point accuracy. TheGROUP BYaverage variant uses a uniform/variable dual-mode count tracker to avoid per-element count allocation in the common case. ParallelGROUP BYis supported viamerge().Queries on Parquet partitions can now skip entire row groups that contain no matching rows. Row group pruning uses three strategies: min/max statistics (row groups whose per-column value range does not overlap the filter values are skipped), bloom filters (row groups whose bloom filter reports no match are skipped, opt-in via the
bloom_filter_columnsoption), and null count statistics (forIS NULLandIS NOT NULLfilters). Pruning applies to all Parquet read paths including forward scan, backward scan,read_parquet(), and parallel page frame execution. Supported filter operations include equality,INlist, comparison operators (<,<=,>,>=),BETWEEN,IS NULL,IS NOT NULL, and OR-connected equalities on the same column. Supported column types include BYTE, SHORT, CHAR, INT, LONG, FLOAT, DOUBLE, TIMESTAMP, DATE, IPv4, UUID, LONG128, STRING, SYMBOL, VARCHAR, and all DECIMAL widths. Bloom filter columns and false positive probability (FPP) can be specified viaALTER TABLE ... CONVERT PARTITION TO PARQUET WITH (bloom_filter_columns = 'col1,col2', fpp = 0.01),COPY TOexport, or the/expHTTP endpoint. New configuration properties includecairo.sql.parquet.row.group.pruning.enabled(defaulttrue),cairo.partition.encoder.parquet.bloom.filter.fpp(default0.01),cairo.parquet.export.bloom.filter.fpp(default0.01), andcairo.parquet.export.statistics.enabled(defaulttrue).This feature adds
arg_max()andarg_min()function variants that accept a CHAR type for the value argument. Supported signatures includearg_max(char, timestamp),arg_max(char, long),arg_max(char, double), and the correspondingarg_min()variants. Null keys are ignored during aggregation, while null values are returned when the corresponding key is the max/min. All variants support parallel execution. Example usage:SELECT arg_max(status, created_at), arg_min(status, created_at) FROM events;This feature allows
WINDOW JOINto accept column references and expressions asRANGE BETWEENboundaries, in addition to static constants, enabling each left-hand-side row to define its own window size based on its data. For example:SELECT t.ts, sum(d.val) AS agg FROM fx_trades t WINDOW JOIN market_data d RANGE BETWEEN t.lookback minutes PRECEDING AND t.lookahead minutes FOLLOWING. Either or both of the lo/hi boundaries can be dynamic while the other remains a static constant. Boundary expressions must evaluate to an integer and must only reference left-hand-side table columns. Negative values are clamped to zero (equivalent toCURRENT ROW), and NULL values produce NULL aggregates where the row is skipped. Each boundary can optionally include a time unit suffix (seconds,minutes,hours, etc.), and when present the value is scaled to the table's timestamp resolution at runtime. Dynamic windows disable the fast symbol-keyed and vectorized execution paths; queries with anONkey equality clause fall back to the general path with a join filter instead. This feature also makes parallelHORIZON JOINandWINDOW JOINqueries more responsive to query cancellation by adding circuit breaker checks at the top of every master-row iteration loop.O3 commits into Parquet partitions previously replaced row groups in-place, leaving orphaned bytes in the file. Repeated merges caused file sizes to grow to 2-3x their useful data. This improvement adds a rewrite mode that periodically writes all data to a fresh file, eliminating dead space. An
unused_bytescounter tracked in Parquet metadata drives the rewrite decision. Rewrite triggers when the file has a single row group, whenunused_bytes / file_sizeexceeds a configurable ratio (default 0.5), or when absolute unused bytes exceeds a threshold (default 1 GB). In rewrite mode, untouched row groups are raw-copied with adjusted thrift offsets without decode/re-encode. A newO3ParquetMergeStrategyclass computes merge/copy/split actions up-front using min/max timestamp overlap detection, replacing the old iterative merge loop. Row groups that exceed the configured size get split into multiple output groups. Small row groups (< 4096 rows) adjacent to a gap absorb the gap's O3 data to avoid proliferation of tiny row groups. WAL tables can now convert their last (active) partition to Parquet, with the TableWriter routing all WAL data through the O3 merge path. New configuration properties:cairo.partition.encoder.parquet.o3.rewrite.unused.ratio(default0.5) andcairo.partition.encoder.parquet.o3.rewrite.unused.max.bytes(default1g).Users can now specify Parquet encoding and compression on a per-column basis using
CREATE TABLEandALTER TABLESQL syntax. The syntax isPARQUET(encoding [, compression[(level)]]), where both encoding and compression are optional — usedefaultfor the encoding when specifying compression only. When omitted entirely, the column uses the global defaults. For example:CREATE TABLE sensors (ts TIMESTAMP, temperature DOUBLE PARQUET(rle_dictionary, zstd(3)), device_id VARCHAR PARQUET(default, lz4_raw)) TIMESTAMP(ts) PARTITION BY DAY;. Existing tables can be modified withALTER TABLE sensors ALTER COLUMN temperature SET PARQUET(rle_dictionary, zstd(3));or reset withALTER TABLE sensors ALTER COLUMN temperature DROP PARQUET;. Per-column config appears inSHOW CREATE TABLEoutput. Supported encodings includeplain,rle_dictionary,delta_length_byte_array, anddelta_binary_packed, with type-specific restrictions. Supported compression codecs includeuncompressed,snappy,gzip(0-9),brotli(0-11),zstd(1-22), andlz4_raw. RLE dictionary encoding is now supported for all column types except Boolean and Array. A newcairo.partition.encoder.parquet.min.compression.ratioconfiguration property (default1.2) controls whether compressed pages are worth keeping — when a compressed column chunk fails to meet the ratio threshold, the encoder discards the compressed output and stores it uncompressed instead. Varchar dictionary encoding performance was significantly improved by switching from the default hasher toRapidHashMapand storing indices directly in aVec<u32>, achieving ~87 Melem/s throughput compared to the original ~13 Melem/s.
Improvements
WalWriter previously hardcoded
POSIX_MADV_RANDOMfor memory-mapped column files, which hurts most workloads. This improvement makes the madvise hint configurable via thecairo.wal.writer.madvise.modeproperty with valid values:none(default, no hint),sequential, andrandom. Therandommode is beneficial when ingesting into many tables with many columns, as it prevents the OS from speculatively reading adjacent pages under memory pressure.This improvement replaces linear scanning with binary search for initial frame positioning in
ASOF JOIN,LT JOIN, andWINDOW JOINoperations. TheopenSlaveFrame()method inAbstractAsOfJoinFastRecordCursorand thefindRowLo()/findRowLoWithPrevailing()methods inWindowJoinTimeFrameHelpernow callseekEstimate()on the first lookup to binary-search directly to the target partition instead of linearly scanning all preceding frames. This reduces the initial positioning cost from O(N) in the number of frames to O(log P) where P is the number of partitions, mirroring the optimization already present inHORIZON JOIN.This improvement brings significant performance gains to Parquet decoding, with a median improvement of approximately 87% across column type and encoding combinations. Key optimizations include removing unnecessary allocation and zeroing in the decompression path, skipping definition decoding when
null_count=0, improvedPlain,DeltaBinaryPacked,Rle, andRleDictionarydecoder implementations, a lookup table for boolean unpacking, specialized UUID byte swapping, batch decoding forDeltaBinaryPacked, batched nullable bitmap processing, and bulk copying forPlainPrimitiveDecoder. This improvement also fixes a bug with incorrect null sentinel values.Vectorized execution is now applied to more non-keyed
GROUP BYqueries, such asSELECT first(price), last(price) FROM trades. Internally, theGroupByFunction#computeBatch()API is used for all queries without a filter. This API was previously used only inWINDOW JOIN.This improvement introduces an encoded sort path for
ORDER BYqueries.SortKeyEncoderencodes column values into fixed-width, order-preserving binary keys (8/16/24/32 bytes), enabling comparisons via native uint64 operations instead of per-column type dispatch. The native sort is a three-layer hybrid: vergesort for detecting natural runs (O(n) for pre-sorted time-series data), MSD radix sort (American Flag Sort) with parallel bucket sorting across available cores via atomic work-stealing for large arrays, and pdqsort for small partitions with heapsort fallback guaranteeing O(n log n) worst-case. The entire native sort path is allocation-free with all data structures stack-allocated. It replaces the R-B tree (RecordTreeChain) with flat array collect plus radix sort for the non-random-access path, and guarantees stable sort by falling back to rowId ordering when all sort key columns are equal. Key size is capped at 32 bytes;ORDER BYclauses exceeding this fall back to the existing R-B tree sort. Benchmarks show a 5-columnORDER BYover 3M rows improving from 6.98s to 1.26s, and a 2-columnORDER BYover 1B rows improving from over 15 minutes to 63.6s with parallel sorting.This improvement replaces string comparison (
getSym()+Chars.compare) with rank-based comparison (getInt()+ rank map lookup) when sorting SYMBOL columns that have static symbol tables. At cursor open, a rank map maps each symbol key to its alphabetical rank, and the comparator then compares ranks instead of strings. It also eliminates the temporaryDirectIntListused when building symbol rank maps by replacing it with in-place permutation inversion after quicksort, reducing extra space from O(N) to O(1) where N is the symbol count, mainly benefiting high-cardinality symbol columns. This optimization applies to all comparator-based sort paths: tree-based sort cursors, top-K cursors, window function tree comparators, and window function internal comparators (rank(),percent_rank()). Non-static symbol columns (e.g.CAST(str AS SYMBOL)) fall back to the original string comparison path.This improvement introduces an adaptive scan strategy for
HORIZON JOIN's keyed ASOF lookup. The lookup now starts in backward-only mode, which is cheap for low-cardinality key spaces, and adaptively switches to forward scan mode within a frame when backward scan cost becomes excessive — for example, with high-cardinality symbols or rare/infrequent keys that cause deep backward scans. The switch uses two criteria: a relative threshold where backward scan cost at a position must exceed 8x the gap to trigger (with a minimum gap of 1,024 rows), and an absolute threshold of 131,072 rows to handle cross-partition boundaries where the relative check cannot trigger. Once switched, the frame stays in forward mode for its remainder. This also re-enables the ASOF row cache inHorizonJoinTimeFrameHelperand resets bookmarks ontoTop()to eliminate jitter from out-of-order frame processing. Performance improvements range from 13-24% for dense and equity scenarios to 96% for sparse data distributions.
Bug Fixes
A
LATEST BY ALLquery on a large table could fill anOrderedMapuntilkeyCapacityreached 2^30. Whenrehash()doubled tonewKeyCapacity = 1L << 31, the overflow guard allowed it through becauseMAX_SAFE_INT_POW_2was incorrectly set to1L << 31. The subsequent(int)cast producedInteger.MIN_VALUE, andclear()computed ~18.4 EB for nativememset, causing a SIGSEGV. This fix correctsNumbers.MAX_SAFE_INT_POW_2from1L << 31to1L << 30, so the guard rejects the overflow and throws a cleanCairoException("map capacity overflow")instead of crashing the JVM. The constant was also deduplicated fromUnordered4Map,Unordered8Map, andUnorderedVarcharMap, each of which had a private copy with the same bug.The
cairo.partition.encoder.parquet.statistics.enabledconfiguration allows users to disable Parquet statistics, but the read path (ParquetTimestampFinder,TableWriter) and the O3 merge path (O3PartitionJob.processParquetPartition) hard-depended on timestamp column statistics. When statistics were absent,getMinValueLongwould hit an assertion crash with-eaor read garbage memory causing silent data corruption without-ea. This fix addsrowGroupMinTimestampandrowGroupMaxTimestampmethods toPartitionDecoderthat try Parquet column statistics first at zero cost, then fall back to decoding the first/last row from actual data pages when statistics are absent.findRowGroupByTimestampalso falls back to decoding instead of reading garbage memory.O3PartitionJob,ParquetTimestampFinder, andTableWriterhave been migrated to use the new methods.This fix resolves a crash that occurred when using
SAMPLE BYwithFILL(NULL),FILL(value), orFILL(PREV)in queries containing array column aggregates such aslast(arr). The fill record types did not implementgetArray(), and the null/value fill factories did not handle array column types when constructing fill constants. The fix addsgetArray()overrides toSampleByFillRecordandFillRangeRecordCursorFactory.FillRangeRecord, and updatesSampleByFillNullRecordCursorFactoryandSampleByFillValueRecordCursorFactoryto handle array types by yieldingNullConstant.NULL, since arrays cannot be filled with scalar values.When a
JOINquery had aWHEREclause containing both a column-referencing condition and a non-column constant expression (e.g.,NOW() = NOW()), QuestDB crashed with an internalAssertionError. The SQL optimizer splits theWHEREclause into separate buckets — one for column-referencing conditions and one for constant expressions. The code generator then applied each bucket as a separateFilteredRecordCursorFactorywrapper, but nesting these factories violated an internal assertion. This fix detects when the base factory is already aFilteredRecordCursorFactory, extracts the existing filter, and combines both filters withANDinto a single factory instead of nesting them.ALTER TABLE ... ALTER COLUMNandALTER MATERIALIZED VIEW ... ALTER COLUMNfailed when the column name was quoted (e.g.,"MY_COL"). The token from the lexer was used as-is for the metadata lookup, so the quoted form was looked up instead of the unquoted name, resulting in a "column does not exist" error. This fix wraps the token withunquote()before the column name lookup in bothcompileAlterTableandcompileAlterMatView.The non-vectorized fast cursor (
WindowJoinWithPrevailingFastRecordCursor) compared the first window match against the window end instead of the window start, causing the prevailing row to effectively never be included when the window already contained matches. The vectorized variant already used the correct comparison. This fix aligns the non-vectorized path to compare against the window start timestamp. The bug only affected queries using non-vectorizable aggregates (e.g.,max(concat(...))) with symbol-keyedWINDOW JOINandINCLUDE PREVAILING.The
read_parquet()function crashed with SIGSEGV when reading parquet files containing SYMBOL columns encoded by QuestDB'sPartitionEncoder. ThecanProjectMetadata()method passed the actual column type (SYMBOL) to the Rust decoder instead of the expected type (VARCHAR), causing the Rust decoder to write INT32 symbol keys that Java then read as VARCHAR pointers. This fix passes the expected type (VARCHAR) for SYMBOL-to-VARCHAR conversions so the Rust decoder resolves dictionary entries to UTF-8 strings.This fix addresses two bugs in
WINDOW JOINwithINCLUDE PREVAILING. The first bug affected four sync cursor variants that callfindRowLo(lo, hi, true):WindowJoinWithPrevailingAndJoinFilterRecordCursor,WindowJoinWithPrevailingAndJoinFilterFastRecordCursor,WindowJoinWithPrevailingFastRecordCursor, andWindowJoinFastVectRecordCursor. When the bookmarked-frame optimization kicked in and the prevailing candidate sat in a prior partition, the bookmarked path skipped initialization ofprevailingFrameIndex/prevailingRowIndex, leaving them at-1/Long.MIN_VALUE. Downstream backward scans saw-1and returned immediately, silently omitting the prevailing row from results. The fix seedsprevailingFrameIndex/prevailingRowIndexwhen entering via the bookmark path. The second bug was a comparison target error inWindowJoinWithPrevailingFastRecordCursorwhere the condition that triggers prevailing inclusion checked whether the first row's timestamp exceeded the window high boundary (masterTimestampHi) instead of the window low boundary (slaveTimestampLo).This fix addresses several correctness, resource leak, and crash issues in the SQL engine. A missing
breakafter theDECIMALinner switch ingenerateCastFunctionscaused fall-through to theBINARYcase, adding a spuriousBinColumnto the cast function list and shifting subsequent column indices in multi-columnUNIONqueries, resulting inUnsupportedOperationExceptionat runtime. ThemoveClausesmethod never incremented its position counter, so whenswapJoinOrder0needed to move multiple join clauses, only the first was moved — the rest stayed in the wrong join context, creating circular dependencies that broke topological sort on 3+ table cross joins with multi-columnWHEREconditions. ThehasGroupByFunc/hasOrderedGroupByFuncmethods skipped children of non-aggregate functions, making nested aggregates likeabs(sum(x))invisible and causingPIVOTto reject valid aggregate expressions. Resource leaks were fixed ingenerateJoinswhere const filters were leaked on error paths, and ingenerateSampleBywheretimezoneNameFunc,offsetFunc,sampleFromFunc, andsampleToFuncwere never freed on error paths. ANullPointerExceptioninCREATE MATERIALIZED VIEWwas fixed — whenWITH BASEspecifies a nonexistent table not referenced in the query,getTableTokenIfExists()returned null and.isView()threw NPE instead of a descriptiveSqlException. Additional latent bugs were fixed ingenerateFill(always reading the first fill value instead of iterating) and in GeoHash-to-VARCHAR cast functions (using wrong column type parameter).This fix corrects a data corruption issue where
set_memory_vanilla_vec()delegated torun_vec_bulk, which usedTVec::size()as the loop increment. ForVec8uqthis returned 8 (the number ofuint64_tlanes), but one 512-bit store only covers 4long_128bitelements or 2long_256bitelements. The mismatch caused the bulk path to advance past unwritten elements, leaving holes filled with stale data. The fix replaces the brokenrun_vec_bulkcall with a dedicated loop that computes the correct per-store element count fromsizeof(TVec) / sizeof(T), guarded by astatic_assert.This fix defaults
cairo.mat.view.parallel.sql.enabledtofalseon machines with fewer than 4 available processors. All other parallel SQL features (filter, group by, top-k, horizon join, window join, Parquet read) remain enabled regardless of core count. Users can still explicitly enable materialized view parallel SQL via configuration on low-core machines.This fix resolves an issue where
read_parquetfailed with "encoding not supported" errors when reading Parquet files whose embedded QuestDB metadata schema had a different number of columns than the actual Parquet data. This happens when an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — for example, dropping partition columns for hive-style directory layouts — but preserves the original key-value metadata. The decoder now compares the QuestDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference.
This release introduces automatic object store WAL cleanup on the primary node, TLS certificate expiration metrics for Prometheus alerting, and new array_sort() and array_reverse() functions. Performance improvements include faster ASOF and WINDOW JOINs through binary search–based frame positioning, along with a configurable WAL writer madvise mode. Several critical bug fixes address crashes in LATEST BY ALL queries, Parquet reads with missing statistics, and backup restore edge cases, while the ACL permission system has been expanded to support up to 256 permissions.
New Features
This feature introduces a WAL cleaner that runs on the primary node and automatically deletes replicated WAL data from object storage once it is no longer needed by any replica or backup. It determines what is safe to delete by consulting two sources of cleanup history — enterprise backup manifests and checkpoint history records — and always retains enough data to support the most recent N backups or checkpoints. The cleaner is conservative by default: it won't delete anything until sufficient history exists, and it picks the most conservative boundary when multiple sources or cluster nodes are involved. Key components include a checkpoint history tracker that records per-table transaction state to the shared replication object store on each
CHECKPOINT RELEASE, a backup instance name registry for coordinating cleanup boundaries across multiple nodes, rate limiting and throttling for object store delete operations with auto-tuned defaults per cloud provider (S3, GCS, Azure Blob, R2, etc.), and crash recovery with periodic progress persistence so cleanup resumes where it left off after a restart. Dropped tables are cleaned up after a cooloff period (default 1h) to guard against clock skew. Key configuration properties includereplication.primary.cleaner.enabled(defaulttrue),replication.primary.cleaner.interval(default10m),replication.primary.cleaner.backup.window.count(default 5),replication.primary.cleaner.delete.concurrency(auto-tuned 4–12),replication.primary.cleaner.max.requests.per.second(service-dependent), andcheckpoint.history.enabled(defaulttruewhen replication is enabled).This feature adds Prometheus gauge metrics for TLS certificate time-to-live (TTL) across all four TLS-enabled endpoints:
questdb_tls_cert_ttl_seconds_http,questdb_tls_cert_ttl_seconds_http_min,questdb_tls_cert_ttl_seconds_line, andquestdb_tls_cert_ttl_seconds_pg. Each gauge reports seconds until the active certificate expires. Values greater than 0 indicate seconds remaining, 0 means expired, and -1 means the certificate has not been loaded or could not be parsed. Gauges are only registered for endpoints where TLS is enabled. The TTL is computed from the certificate'snotAfterfield, which is extracted via a JNI call into a minimal DER/X.509 parser on the Rust side. The expiration epoch is cached and updated onreload_tls(), so the metric always reflects the active in-memory certificate, not the one on disk.This feature adds
array_sort(DOUBLE[])andarray_reverse(DOUBLE[])scalar functions that operate on double arrays of any dimensionality.array_sort()sorts each innermost-dimension slice independently, preserving the array's shape, and accepts optional boolean arguments for descending order and nulls-first placement.array_reverse()reverses each innermost-dimension slice. Both functions handle NULL arrays, empty arrays, NaN values, and multidimensional inputs. They support both contiguous unit-stride and non-vanilla array layouts via separate code paths. The sort buffer grows on demand and stays at peak size for the cursor's lifetime to avoid allocation churn on the hot path.This feature adds
minTimestampandmaxTimestampTIMESTAMP columns tosys.telemetry_walto capture the data timestamp range per WAL transaction event. WAL telemetry is now enabled by default regardless of the main telemetry setting, reducing the dependency on logs to investigate data writing shape. The WalWriter commit log message has been downgraded from info to debug level unless the commit has a replace range. Schema migration support has been added so that when the column count mismatches the expected schema, the table is dropped and recreated, which is safe given the 1-week TTL.
Improvements
This improvement mirrors the optimization already present in
HORIZON JOIN. Without it, the first lookup linearly scans through all slave time frames preceding the master's first timestamp, which is O(N) in the number of frames. With theseekEstimate()optimization, the initial positioning is O(log P) where P is the number of partitions. Specifically,AbstractAsOfJoinFastRecordCursor.openSlaveFrame()now callsseekEstimate()on the first slave frame lookup to binary-search directly to the target partition instead of linearly scanning all preceding frames, benefiting allASOF JOINandLT JOINfast-path factories.WindowJoinTimeFrameHelper.findRowLo()andfindRowLoWithPrevailing()also now callseekEstimate()on the first lookup with the same partition-skipping behavior, benefiting both sync and asyncWINDOW JOINfactories.WalWriter previously hardcoded
POSIX_MADV_RANDOMfor memory-mapped column files, which hurts most workloads. This improvement makes the madvise hint opt-in via a new configuration propertycairo.wal.writer.madvise.modewith valid values:none(default, no hint),sequential, andrandom. Therandommode is beneficial when ingesting into many tables with many columns, as it prevents the OS from speculatively reading adjacent pages under memory pressure.Example configuration:
cairo.wal.writer.madvise.mode=randomThis improvement refactors the ACL permission system to support more than 64 permissions by migrating from 64-bit bitmasks to an exponent-based representation with 256-bit aggregate masks. Permission constants changed from
longbitmasks tointexponents, and a newPermissionMaskclass provides 256-bit storage (4 longs) for aggregate permission sets. The permission column type in the database schema changed fromlongtoshort(storing exponents instead of bitmasks), reducing storage overhead while supporting up to 256 distinct permissions.PermissionMask.ZEROis now immutable and throws on mutation attempts, and the sentinel value handling forALL_PERMISSIONSis properly supported acrosshas,set, andclearoperations.
Bug Fixes
This fix corrects
Numbers.MAX_SAFE_INT_POW_2from1L << 31to1L << 30. The old value (2^31) does not fit in a signed 32-bit int, so the rehash overflow guard let exactly 2^31 through. The subsequent(int)cast producedInteger.MIN_VALUE, andclear()fed approximately 18 EB to nativememset, causing a SIGSEGV. The crash chain occurred when aLATEST BY ALLquery on a large table filled anOrderedMapuntilkeyCapacityreached 2^30, thenrehash()doubled tonewKeyCapacity = 1L << 31, which truncated to a negative value and passed an enormous size to nativememset. The fix makes the guard rejectnewKeyCapacity = 2^31, throwing a cleanCairoException("map capacity overflow")instead of crashing the JVM. The constant was also deduplicated fromUnordered4Map,Unordered8Map, andUnorderedVarcharMap, each of which had a private copy with the same bug.The
cairo.partition.encoder.parquet.statistics.enabledconfiguration allows users to disable Parquet statistics, but the read path (ParquetTimestampFinder,TableWriter) and the O3 merge path (O3PartitionJob.processParquetPartition) hard-depended on timestamp column statistics. When statistics were absent,getMinValueLongwould hit an assertion crash with-eaor read garbage memory causing silent data corruption without-ea. This fix removes that hard dependency by addingrowGroupMinTimestampandrowGroupMaxTimestampmethods toPartitionDecoderthat try Parquet column statistics first at zero cost, then fall back to decoding the first/last row from actual data pages when statistics are absent. ThefindRowGroupByTimestampmethod also falls back to decoding instead of reading garbage memory, andO3PartitionJob,ParquetTimestampFinder, andTableWriterhave been migrated to use the new methods.During backup, partitions with
row_count=0may not producemeta.msgpack. The restore process previously always attempted to download this file when hash verification was enabled, causing restores to fail with "no partition metadata found" for empty partitions. This fix skips downloadingmeta.msgpackfor empty partitions during restore and skips hash verification in that case, while still requiring metadata for non-empty partitions.A primary instance restored from a backup could encounter a race condition between the WalPurgeJob and dropped table request processing, where the uploader could not open the
_txnlogbecause it had already been deleted. This fix ensures that the WalPurgeJob does not delete state that the uploader still needs. The bug was unique to backups since backups do not restore.pendingfiles used to control this workflow. At start-up, the adjusted WalUploader's replication logic now ensures that the appropriate.pendingfile is recreated when missing, which also patches instances that were already restored with older versions of QuestDB Enterprise.This fix introduces three new ACL permissions:
ALTER SYMBOL CAPACITY(column-level),SET REFRESH LIMIT(table-level), andSET REFRESH TYPE(table-level).ALTER SYMBOL CAPACITYreplaces the incorrect reuse ofALTER COLUMN TYPEfor symbol capacity changes. A startup migration automatically grantsALTER SYMBOL CAPACITYto every entity that previously hadALTER COLUMN TYPE, at the same scope (database, table, or column level), preserving grant options.SET REFRESH LIMITandSET REFRESH TYPEgate the previously unprotectedALTER MATERIALIZED VIEW ... SET REFRESH LIMIT/IMMEDIATE/MANUAL/EVERY/PERIODoperations. This fix also wires in the previously commented-out authorization check forALTER TABLE SET PARAMand adds a retry loop with recompile in access list reloading to handle table reference out-of-date exceptions during ACL reload.
QuestDB 9.3.3 is a feature-rich release introducing HORIZON JOIN for markout analysis, a new twap() aggregate, SQL-standard WINDOW definitions, JIT compilation on ARM64, and file-based secrets for Kubernetes deployments. It also brings significant performance improvements across Parquet I/O, parallel GROUP BY, UNION queries, and ORDER BY on computed expressions.
New Features
Sensitive configuration options can now be loaded from files using the
_FILEsuffix convention, enabling seamless integration with Kubernetes Secrets, Docker Secrets, HashiCorp Vault, and other file-based secrets management systems. For example, settingQDB_PG_PASSWORD_FILE=/run/secrets/pg_passwordas an environment variable orpg.password.file=/run/secrets/pg_passwordinserver.confwill load the password from the specified file. This feature works with all sensitive properties (pg.password,pg.readonly.password,http.password). File contents are automatically trimmed of whitespace.SHOW PARAMETERSdisplaysvalue_source = 'file'for secrets loaded from files, and secrets are reloaded when file contents change viaSELECT reload_config().This feature allows underscores as number separators in TICK date expressions (e.g.,
$now-10_000T) and duration suffixes (e.g.,;1_500T), consistent withNumbers.parseInt()which already supports this. Error position reporting in date expression evaluation has also been improved to point at the specific offending location instead of the start of the expression. Invalid underscore placement (leading, trailing, consecutive) is properly detected and reported.This feature enables QuestDB's JIT code generation to run natively on ARM64 systems in addition to x86. The changes introduce ARM64-specific implementations for register handling, caches, and code generation, while maintaining full compatibility with the existing x86 logic. Conditional compilation is used throughout to select the appropriate architecture at build time. ARM64-specific versions of
FunctionandCountOnlyFunctionstructs were implemented, including methods for code generation, register setup, and loop logic that mirror the x86 implementations but use ARM64 instructions and register types.This feature introduces the
WINDOWclause, allowing users to define named window specifications that can be reused across multiple window functions within a single query. Window inheritance following the SQL standard is also supported, where a named window can reference another named window as its base, inheritingPARTITION BY,ORDER BY, and frame clauses. Chained inheritance is supported (e.g., w3 references w2, which references w1). Merge rules follow the SQL standard:PARTITION BYis always inherited from the base (child cannot specify its own),ORDER BYin the child takes precedence if specified, and frame clauses in the child take precedence if non-default. Validation ensures base windows must be defined earlier in theWINDOWclause,PARTITION BYin a child window referencing a base is rejected, and circular and self-references are prevented.This feature introduces the
twap(price, timestamp)aggregate function that computes the time-weighted average price using step-function integration, where each price is held constant until the next observation and the TWAP is the area under this step function divided by the total time span. The function supports parallelGROUP BYexecution via per-worker native buffers that are merge-sorted during the merge phase, since workers process non-adjacent page frames via work-stealing and an incremental weighted-sum approach would incorrectly bridge gaps between frames. It falls back to a simple arithmetic mean when all observations share the same timestamp.SAMPLE BYwithFILLmodes is also supported.This feature adds support for all six decimal types (DECIMAL8, DECIMAL16, DECIMAL32, DECIMAL64, DECIMAL128, and DECIMAL256) in the Parquet format. Write support stores decimals as fixed-length byte arrays in big-endian format per the Parquet specification. Read support includes a
WordSwapDecimalColumnSinkfor DECIMAL128 and DECIMAL256 that correctly converts from Parquet's big-endian format by reversing each 8-byte word independently. The supported Parquet physical types are INT32 for small decimals (DECIMAL8/16/32), INT64 for DECIMAL64, and FixedLenByteArray for all decimal types.This feature introduces
array_build(nDims, size, filler1, ...)that createsDOUBLE[]orDOUBLE[][]arrays with controlled shape and fill values. Thesizeparameter accepts a scalar integer or aDOUBLE[](using its cardinality). Each filler can be a scalar (repeated for all elements) or aDOUBLE[](copied element-by-element, NaN-padded if shorter, truncated if longer).This feature introduces
HORIZON JOIN, a specialized time-series join designed for markout analysis — a common financial analytics pattern where you need to analyze how prices or metrics evolve at specific time offsets relative to events such as trades or orders. For each row in the left-hand table and each offset in the horizon, the join computesleft_timestamp + offsetand performs an ASOF match against the right-hand table. When join keys are provided viaON, only right-hand rows matching the keys are considered. The horizon is defined using either aRANGE FROM <from> TO <to> STEP <step>clause for uniform offsets, or aLIST (<offset>, ...)clause for explicit non-uniform offsets, both aliased withAS. The pseudo-table exposes.offset(LONG) and.timestamp(TIMESTAMP) columns. BothRANGEandLISTuse the same interval expression syntax asSAMPLE BYwith supported units: U (microseconds), T (milliseconds), s (seconds), m (minutes), h (hours), d (days). Use cases include markout P&L analysis, event impact studies, and time-series correlation at different lags. Current limitations include no combination with other joins at the same query level, no right-hand sideWHEREfilters, both tables requiring a designated timestamp, positiveSTEPwithFROM≤TOforRANGE, and monotonically increasing offsets forLIST.SELECT h.offset / 1000000 AS horizon_sec, t.sym, avg(m.mid) AS avg_mid FROM trades AS t HORIZON JOIN mid_prices AS m ON (t.sym = m.sym) RANGE FROM 1s TO 60s STEP 1s AS h ORDER BY t.sym, horizon_secThis feature adds a Table Details Drawer to the Web Console for tracking the health and details of a table or materialized view. It simplifies the table schema explanation flow by removing flow-specific schema and adding streaming support, along with a new flow for table health issues. An adaptive polling mechanism checks request latency proactively and sets the interval dynamically. WAL tables polling has been removed except for the suspension dialog. The feature also includes monitoring support in the AI Assistant docs, updated tooltip behavior and styling throughout the application, query truncation support for LiteEditor, history and navigation among right-hand-side bar drawers, and a mechanism to copy table and column names using Ctrl/Cmd+C.
This feature adds import and export support for editor tabs, makes tabs scrollable with an increased minimum width, adds a rename button that appears on tab hover, and updates the editor tab styling.
Improvements
This improvement introduces three optimizations for Parquet partition reads. Late materialization decodes only filter columns first to identify matching rows, then decodes remaining columns only for those rows, significantly reducing decoding overhead for low-selectivity queries. Zero-copy mmap page reading uses a new
SlicePageReaderthat reads Parquet pages directly from the memory-mapped byte slice, bypassing the previous approach of copying page data into an intermediate buffer. Raw array encoding is now enabled by default for partition-to-Parquet conversion, avoiding the overhead of Parquet's nested LIST decoding. In benchmarks, an OHLC aggregation query on an 8-day Parquet partition improved from 600ms to 250ms.This improvement speeds up Parquet export through two optimizations. SIMD-accelerated encoding uses portable SIMD intrinsics for nullable INT/LONG columns, processing 64 values per iteration for vectorized definition level encoding, yielding up to 16.5% faster throughput and 20% less CPU usage with LZ4RAW compression. A new streaming mode for
TableReaderappliesMADV_SEQUENTIALon mmap for aggressive read-ahead prefetching andMADV_DONTNEEDbefore munmap for immediate page cache release. Streaming mode bypasses the mmap cache so each partition mapping is independent and fully releasable. Parquet exporters automatically enable streaming mode. Under memory pressure, the combination of both madvise hints recovers 94% of baseline performance (387 MB/s vs 82 MB/s without hints).This improvement pushes designated-timestamp filters from outside a
UNION/UNION ALL/EXCEPT/INTERSECTsubquery into each branch of the set operation, enabling per-branch partition pruning on time-series tables. The optimizer only pushes filters that exclusively reference the designated timestamp column, deliberately excluding non-timestamp filters to avoid type mismatches and column resolution issues across branches with different schemas. The outer filter always stays on the wrapper model, so correctness is unchanged. For composite filters likeWHERE ts IN '2025-12-01T01;2h' AND x > 5, the optimizer splits the top-levelANDinto separate conjuncts and processes each independently: it pushes the timestamp conjunct into branches, whilex > 5stays at the parent level only. Safety mechanisms include per-branch semantic guards that skip branches withLATEST BY,LIMIT, or non-pushableSAMPLE BY, column existence checks, and deep cloning to prevent cross-branch mutation.This improvement adds an unordered mode to
PageFrameSequencethat usesSOUnboundedCountDownLatchinstead of ordered collection, eliminating head-of-line blocking for reducers that don't need ordered results. KeyedGROUP BY, non-keyedGROUP BY, and Top K factories now use this unordered mode, replacing boilerplate ordered collect loops with a singledispatchAllAndAwait()call. Error field copying instoreError()was also fixed to use a dedicatedCairoExceptioninstance, preventing thread-local exception recycling from corrupting stored error data. At concurrency 8, per-iteration spread (a direct proxy for head-of-line blocking) was reduced by 53–75% across query types, tail latency (p99) improved by 30–44%, and latency predictability (p99/p50 ratio) improved from 2.5–3.6x down to 1.4–1.8x. Throughput at concurrency 8 improved by 6–9% for most query types, and throughput now scales monotonically from concurrency 1 to 2 instead of regressing as on the previous implementation.CASEexpressions on symbol columns with static symbol tables now resolve string constants to integer symbol keys at initialization time and compare by integer at runtime, avoiding per-row string comparisons. Three picker specializations are used based on branch count: single-branch (one equality check), dual-branch (two equality comparisons), and multi-branch (integer-object hash map lookup). When aWHENvalue is not found in the symbol table, the branch is silently skipped. Non-static symbol tables (e.g., from casts) fall back to the existing character sequence-keyed comparison.This improvement replaces the Dragon4 double-to-string algorithm with Ryu (Ulf Adams, 2018), which produces minimal-length, correctly-rounded decimal representations with better performance. A direct double-to-decimal conversion path (
Numbers.doubleToDecimal) bypasses string formatting entirely, removing the double→string→decimal roundtrip inCastDoubleToDecimalFunctionFactory. The dependency onjdk.internal.math.FDBigIntegerhas been eliminated, removing the--add-exports java.base/jdk.internal.mathcompiler flag.When sorting by computed expressions (e.g.,
ORDER BY a + b * c), QuestDB'sSortedLightRecordCursorpreviously re-evaluated the expression on every comparison during tree insertion and output, meaning sort key functions were called O(N log N) times during sorting plus O(N) during output. This improvement adds a sort key materialization layer that pre-computes expensive sort key values into off-heap buffers before sorting, reducing function evaluations from O(N log N) to O(N). Each function gets a complexity score that propagates through the expression tree. If any sort key column exceeds the configured complexity threshold (default 3, configurable viacairo.sql.sort.key.materialization.threshold), that column's values are pre-computed and stored in per-column memory buffers. A materializing record intercepts reads for materialized columns from the buffer and delegates to the base record for everything else. The feature supports two-pass operation where subsequent passes aftertoTop()look up existing ordinals without re-computing values. This optimization applies to fixed-size types only (no string/varchar/binary materialization) and operates on the general sort path only (no Top-K modification).This improvement replaces the digit-by-digit loop in
doubleToDecimal()with a single wide multiply via a newofDigitsAndPower(long digits, int power)method on theDecimalinterface. The old code extracted up to 17 individual digits from the Ryu significand via long division, then performed 17 wide additions (128-bit or 256-bit). The new code multiplies the significand by a single power of 10. Benchmarks show Decimal64 is 1.8x faster, Decimal128 is 2.2x faster, and Decimal256 is 2.8x faster. The three types converge to roughly the same cost since the single multiply dominates and the word-width difference matters less for one operation than for 17.This improvement introduces a hybrid export mode for Parquet export (
COPY ... TOand HTTP/exp) that passes through raw page-frame-backed columns zero-copy and materializes only computed columns into native buffers, row by row per frame. Queries with no page-frame backing (e.g., cross joins) now use a cursor-based export that materializes all columns row by row, also avoiding the temporary table detour. Only queries requiring re-partitioning (PARTITION BYoverride) or containing a computedBINARYcolumn still use the temporary table path. The export mode is determined by inspecting the compiledRecordCursorFactorybefore constructing any temporary table:DIRECT_PAGE_FRAMEfor factories supportingPageFrameCursordirectly (zero-copy),PAGE_FRAME_BACKEDfor virtual record cursor factories whose base supports page frames with no computedBINARYcolumns,CURSOR_BASEDfor no page-frame backing or descendingORDER BYwith computed columns, andTEMP_TABLEforPARTITION BYoverrides or computedBINARYcolumns. TheHybridColumnMaterializerhandles bothPAGE_FRAME_BACKEDandCURSOR_BASEDmodes, converting computedSYMBOLcolumns toSTRINGin adjusted metadata while dispatching via source type for correct record accessor usage. A buffer pool recycles computed-column native buffers rather than freeing them after each row group flush. This improvement also fixes several resource lifecycle issues on error paths including page-frame cursor leaks, temp directory leaks onmoveFile()failure, and use-after-free on error path cleanup. The exporter hierarchy was refactored so thatBaseParquetExporterholds shared state whileHTTPSerialParquetExporterandSQLSerialParquetExporterextend it independently, removing the prior dependency where the SQL exporter extended the HTTP exporter.
Bug Fixes
When QuestDB fails to start due to configuration errors, a race condition between the JVM shutdown hook and the startup script could produce confusing
catandrmerror messages for thehello.txtfile. This fix redirects stderr to/dev/nullfor those commands and adds proper quoting around file paths for robustness with paths containing spaces. The actual startup failure details remain available in thestdout-*.txtlog files.Concurrent Parquet exports could produce corrupted or truncated files when multiple clients exported simultaneously through the same HTTP worker. The issue occurred because
ExportQueryProcessorstored per-connection state in per-processor fields. When a worker parked one connection viaPeerIsSlowToReadExceptionand began serving another, these fields were overwritten, causing theonWrite()callback to write Parquet chunks to the wrong response on resume. This fix moves the affected fields to per-connectionExportQueryProcessorState.This fix corrects greedy nanosecond parsing (single
Nfollowed by non-digit) to be consistent with milliseconds and microseconds. Previously,.SSSUUUNwith input.1234567produced.123456007Z(7ns) instead of the correct.123456700Z(700ns).This fix improves the reliability of memory-mapped file handling for columns on Windows platforms. A fallback mechanism for mapping files on Windows has been introduced, which gracefully handles transient permission errors ("Access Denied" even after successful file open) by falling back to an anonymous memory map populated by reading the file contents directly. On non-Windows platforms, the original mapping logic is retained.
This fix addresses an issue where
DynamicPropServerConfiguration.reload()accumulated stale entries inchangedKeysacross reload cycles when only a secret file changed while the properties file remained unchanged. This caused watchers to be spuriously notified about properties that did not actually change. The root cause was thatchangedKeys.clear()only lived insideupdateSupportedProperties(), which is skipped when the properties file hasn't changed. The fix moveschangedKeys.clear()to the top ofreload(), inside thesynchronizedblock, so every reload cycle starts fresh.This fix resolves an issue where the
ALTER TABLE t ADD COLUMN IF NOT EXISTS col <type>path did not fully resolve parameterized or compound column types before comparing with existing metadata. ForDECIMALandGEOHASH, the path usedColumnType.typeOf()which returns the base type constant, but metadata stores the fully encoded type (with precision/scale/bits). For array types (e.g.,DOUBLE[]), the path did not parse array dimensionality brackets, causing type mismatch errors and unconsumed tokens. Unsupported array element types (e.g.,INT[]) now correctly report "unsupported array element type" instead of falling through to a misleading error, and unmatched]brackets after a type name are now detected with a helpful error pointing at the bracket.This fix adds post-compilation validation in the SQL compiler to reject unconsumed tokens after a valid statement. Previously, trailing content was silently ignored, which could mask user errors. PostgreSQL Wire Protocol compatibility no-op handlers (
RESET,CLOSE,UNLISTEN,DISCARD) now require their expected arguments instead of silently accepting bare keywords.BEGIN,COMMIT, andROLLBACKnow properly consume the optionalTRANSACTIONkeyword. TheSETstatement syntax is now validated to conform toSET [SESSION | LOCAL] name { = | TO } value [, value]*, rejecting malformed forms such as missing name, invalid operator, missing value, and dangling commas.This fix resolves two related bugs in
SqlOptimiser.propagateTopDownColumns0()that causedAssertionErrorinSqlCodeGenerator.checkIfSetCastIsRequiredwhenUNIONsibling models ended up with differenttopDownColumnscounts. The first bug occurred whenWHEREclause and timestamp column literals were emitted to union model branches by name resolution; sinceUNIONmatches columns by position rather than name, the same alias could resolve to different column indices in different branches, causing one branch to receive extra top-down columns. The fix removes the name-based emission loops in favor of the existing index-based propagation. The second bug occurred when aGROUP BYorSAMPLE BYmodel as the firstUNIONmember added non-aggregate key columns to its owntopDownColumnswithout propagating them to union siblings. The fix ensures these columns are propagated to all union siblings whosetopDownColumnsare already populated.This fix addresses performance degradation when exporting tables with 1M+ distinct symbols to Parquet. Previously, the large default batch size of 1M rows from the general
CREATE TABLE AS SELECTsetting prevented frequent batch commits, causing symbol index re-scaling to be deferred and degrading performance as the symbol table grew without capacity adjustments. A new configuration propertycairo.parquet.export.batch.sizewith a default of 100K rows is now used specifically for Parquet exports.When
CREATE TABLE ... AS (SELECT ...)fails during data population with a non-Cairo exception, the partially created table and its name lock were not cleaned up. This also affects COPY parquet export, which uses temp tables internally. This fix adds fallback cleanup in COPY parquet export to match the existing HTTP export pattern.This fix reads the
lenbyte inDecimalBinaryFormatParseras unsigned (& 0xFF) instead of signed. A malformed Influx Line Protocol message with a highlenbyte (e.g.0x80) was interpreted as negative, which skipped theVALUESparsing state and leftunscaledValuesempty. The subsequentload()call then hit anArrayIndexOutOfBoundsExceptionaccessing index 0 of the empty list.When a view name is quoted (single or double quotes),
SqlParserpassed the quoted token directly togetTableTokenIfExists(), which failed to find the view. This affected three code paths:SELECT ... FROM 'my_view',SELECT ... JOIN 'my_view' ON ..., andCOMPILE VIEW 'my_view'. This fix addsunquote()to all threegetTableTokenIfExists()calls so the view is recognized regardless of quoting.When a
CASE/WHENexpression contains a window function (e.g.,lag() OVER (...)), column references outside the window function (inTHEN/ELSEbranches) were not emitted to the translating model. This caused "invalid column" errors when using the original column name alongside aSELECTalias (e.g.,price AS pthenTHEN price). This fix introducesreplaceWindowFunctionOrLiteral()that chains window function replacement with literal emission in a single tree traversal pass, so all column references are propagated through the model chain.This fix introduces
CompiledTickExpression, which pre-parses tick expressions containing date variables ($now,$today,$yesterday,$tomorrow) into a singlelong[]intermediate representation at compile time. Runtime evaluation performs only long arithmetic with no string parsing or allocations. A newDateVariableExprencodes and evaluates$variable ± offsetexpressions supporting all time units plus business days. Expressions containing$variables now produce a dynamicCompiledTickExpressioninstead of re-parsing the string on every query execution. The evaluation algorithm walks elements to emit intervals, applies day filter bitmasks in local time, handles timezone conversion (numeric subtraction or DST-awaretoUTC), applies exchange schedule filtering with optional duration, and sorts and merges overlapping intervals. Additionally,Character.isWhitespaceandCharacter.isDigitcalls were replaced withChars.isAsciiWhitespaceandChars.isAsciiDigitthroughoutIntervalUtilsto avoid locale-dependent behavior.This fix moves the circuit breaker check in
DatabaseCheckpointAgent.checkpointCreate()from afterff.sync()to before it. The POSIXsync()system call flushes all dirty filesystem buffers system-wide, not just QuestDB's files. On busy hosts with heavy I/O from other processes,sync()can block for well over the 60-second query timeout. Previously, the circuit breaker was only checked aftersync()returned, so a completed checkpoint was discarded when the timeout had been exceeded during the blocking call. With this fix, if the timeout is already exceeded from the table loop or lock acquisition, the operation fails fast without entering a potentially long-blockingsync(). If the timeout has not been exceeded,sync()runs to completion and the checkpoint succeeds regardless of how longsync()takes.FILL(LINEAR)was disregardingFROM, meaning it would return calendar-aligned results instead ofFROM-aligned results. This fix ensuresFROManchors the initial timestamp andTOapplies as a bound. There is no intent for this to support fill expansion (i.e., filling before and after the dataset); it is unclear at this stage how pre- and post-filling should operate with linear fills.Tick expressions containing date variables (
$now,$today, etc.) used as filter predicates — e.g.,WHERE now() IN '$now-100s..$now'— resolved$nowonce at compile time and cached the resulting interval as static timestamps. On repeated execution of a cached SQL statement, the interval never updated, causing the filter to return zero rows oncenow()moved past the originally computed range. The interval model path was already fixed to compile such expressions into IR viaCompiledTickExpression, but the filter function path inInTimestampTimestampFunctionFactorywas not updated and still used the static function. This fix detects date variables in constant tick expression strings and routes them throughCompiledTickExpression, which re-evaluates the interval from IR on each execution viainit().This fix addresses a vulnerability where
ChunkedContentParser.parseChunkLength()had no overflow guard on the hex chunk-size parsing loop. A crafted HTTP request with 16+ hex digits in the chunk size would overflow thelong chunkSizeto a negative value, corrupting the internal buffer pointer. On the next loop iteration,isEol()would dereference the corrupted pointer, causing the JVM to crash with SIGSEGV. This was reachable over the network with an unauthenticatedPOSTrequest usingTransfer-Encoding: chunked. The fix adds an overflow guard before eachchunkSize * 16accumulation step: ifchunkSize > Long.MAX_VALUE >>> 4, the next multiply would overflow a positivelong, so the input is rejected as a protocol violation. The server logs the violation and disconnects the client cleanly.This fix resolves a crash where
WINDOW JOINandHORIZON JOINwould throw aNullPointerExceptionwhen the right-hand side (slave) is a subquery that applies a timestamp interval filter or reorders columns. The root cause was thatSelectedRecordCursorFactoryandExtraNullColumnCursorFactorydid not implementnewTimeFrameCursor(), which is required by the concurrent time frame cursor infrastructure used in these joins. Additionally,SelectedPageFrameCursordid not implementTablePageFrameCursor, causing aClassCastException. After fixing those, a subtler double column remapping bug remained: bothSelectedConcurrentTimeFrameCursorandSelectedPageFrameCursorwere independently remapping column indices throughcolumnCrossIndex, causing symbol table lookups to hit wrong columns and the timestamp column to be read from the wrong position in the address cache. This fix extractsConcurrentTimeFrameCursorinto an interface, addsnewTimeFrameCursor()to the affected factories, and simplifiesSelectedConcurrentTimeFrameCursorto delegate column remapping entirely toSelectedPageFrameCursor, avoiding double remapping.
This release introduces Kubernetes secrets file support for enhanced security in containerized deployments, adds TICK exchange calendar support for financial market applications, and delivers significant Parquet read performance improvements through advanced optimization techniques. The update also addresses several critical bug fixes including a rare WAL file error on Windows replicas, backup log message accumulation issues, and Parquet export corruption under concurrent connections.
New Features
This feature enables reading sensitive configuration values from files using the
_FILEsuffix convention, allowing native Kubernetes secret file mounts without requiring shell scripts or init containers. It works with both environment variables (QDB_ACL_ADMIN_PASSWORD_FILE) and properties (acl.admin.password.file). All properties marked assensitive=trueare supported, includingacl.admin.password,acl.oidc.tls.keystore.password,replication.object.store,cold.storage.object.store,backup.object.store*, and authentication passwords. The_FILEvariant takes precedence over direct values, with fallback to direct values when no_FILEvariant exists.This feature adds enterprise exchange calendar support, enabling TICK expressions to filter time intervals by stock exchange trading schedules using syntax like
2025-01-24#XNYSfor NYSE trading hours. The implementation includes anEntExchangeCalendarServicethat loads trading schedules for 67 global exchanges from a bundled Parquet file, supports custom schedule overrides via a_exchange_calendars_customtable, and provides lazy-loading with concurrent access protection. It includes areload_exchange_calendars()function for managing custom schedules and anexchange_calendars()table function for querying effective calendar data. The feature handles multi-session trading days, lunch breaks, and case-insensitive exchange codes.This feature implements late materialization optimization for Parquet partitions in parallel query execution with selective filters. Instead of decoding all columns upfront, it first decodes only filter columns to identify matching rows, then decodes other columns only for rows that passed the filter, significantly reducing decoding overhead for low-selectivity queries. The feature introduces
SlicePageReaderfor zero-copy memory-mapped page reading, bypassing intermediateVec<u8>copies. Additionally, it defaultspartitionEncoderParquetRawArrayEncodingtotruefor better performance by using raw array encoding instead of nested LIST decoding.
Bug Fixes
This fix resolves several race condition issues related to replication. Previously, WAL downloader tasks could leak when cancelled, causing conflicts with
WalPurgeJobover file locks. The fix replacesJoinSet::shutdownwith aCancellationTokento properly notify and wait for all tasks to complete, preventing file handle leaks on Windows. Additionally, this fix replaces the file-based lock mechanism with a centralized semaphore-based system usingWALSegmentLockManagerto coordinate access betweenWalPurgeJob,WalWriter, and the WAL downloader.This fix addresses a race condition where QuestDB startup failures (such as invalid configuration) would cause confusing
catandrmerror messages when thehello.txtfile is removed bydeleteOnExit()beforeprint-hello.shcan process it. The fix redirects stderr to/dev/nullfor these commands and adds quotes around$HELLO_FILEfor robustness with paths containing spaces. The actual startup failure details remain available instdout-*.txtlogs.This fix resolves an issue where concurrent Parquet exports could produce corrupted or truncated files when multiple clients exported simultaneously through the same HTTP worker. The problem occurred because the
ExportQueryProcessorstored per-connection state in per-processor fields, causing connection contexts to be overwritten when workers switched between connections. This resulted in Parquet chunks being written to the wrong response, producing invalid data that failed to open in downstream tools.
QuestDB Enterprise 3.2.1 brings the long-awaited scheduled backups feature. Easy to configure, simply provide an object storage configuration, a schedule, and the number of backups to retain, and the database will be incrementally backed up for you. Enterprise 3.2.1 also brings improvements from the latest OSS 9.3.2 release, upgrading query performance and latency, new aggregate and window functions, and our new TICK syntax, which is a convenient DSL for expressing complex time intervals.
New Features
This feature introduces comprehensive backup functionality backed by object stores with support for incremental backups, automatic scheduling via cron expressions, and SQL commands for backup management. The implementation includes
BACKUP START,BACKUP ABORTcommands andbackups(),backup_schedule()functions for monitoring backup status and progress. Backups can be configured to run automatically on a schedule with configurable retention policies and garbage collection. The feature supports both WAL and non-WAL tables, handles partition-level incremental updates, and includes data integrity verification using Blake3 checksums. Backup operations run asynchronously and can be monitored through SQL functions that report progress as a percentage of uploaded partitions.
This release introduces significant SQL enhancements including new aggregate functions (arg_min, arg_max, bool_and, bool_or, bit operations, geomean), window functions (VWEMA, EMA, percent_rank), and geospatial functions (within_box, within_radius). The update also features the new TICK (Temporal Interval Calendar Kit) for interval literals and improved timestamp predicate handling. Additionally, the release includes important bug fixes for ILP writes, window joins, parquet operations, and performance optimizations for concurrent queries and parquet decoding.
New Features
This feature adds Volume-Weighted Exponential Moving Average (VWEMA) as a new window function with the signature
avg(price, kind, param, volume). It supports three smoothing modes: 'alpha' for direct smoothing factor (0 < α ≤ 1), 'period' for EMA-style period where α = 2/(period+1), and time units ('second', 'minute', 'hour', 'day') for time-weighted decay with tau parameter. The VWEMA formula calculates numerator = α × price × volume + (1-α) × prev_numerator, denominator = α × volume + (1-α) × prev_denominator, and VWEMA = numerator / denominator. For time-weighted mode, α = 1 - exp(-Δt / τ).This feature implements
arg_min(value, key)andarg_max(value, key)aggregate functions that return the value of the first argument at the minimum/maximum value of the second argument. These functions support 18 type combinations including double, timestamp, long, and uuid types for both value and key parameters. The functions include full parallel execution support with proper merge logic, correct null handling for both keys and values, and UUID comparison using unsigned long comparison. Null keys are ignored, and the functions return null if the value at the min/max key is null.This feature implements Exponential Moving Average (EMA) as a window function accessible via
avg(value, kind, param)syntax. It supports three modes: 'alpha' for direct smoothing factor (0 < alpha ≤ 1), 'period' for N-period EMA where alpha = 2 / (N + 1), and time units ('second', 'minute', 'hour', 'day', 'week') for time-weighted decay using alpha = 1 - exp(-Δt / τ). The function works with both microsecond and nanosecond timestamp precision via TimestampDriver and supportsPARTITION BYfor computing separate EMAs per group. NULL values are skipped while preserving the previous EMA value.This feature adds
bool_and(T)aggregate function that returns true if all values are true, andbool_or(T)aggregate function that returns true if any value is true. Both functions support parallel execution via merge method, work withGROUP BYandSAMPLE BYclauses, and accept boolean expressions as arguments. The functions include constant folding optimization that returns results directly without row-by-row computation when the argument is constant.These functions perform bitwise operations on all non-null values in a column, supporting
byte,short,int, andlongdata types. TheBIT_AND()function returns the bitwise AND of all values,BIT_OR()returns the bitwise OR, andBIT_XOR()returns the bitwise XOR. All functions include proper null handling (nulls are skipped), constant folding optimization, and parallel execution support. They returnnullfor empty tables or all-null inputs and can be used withGROUP BYclauses.This function computes the geometric mean of positive numbers using the formula
exp(avg(ln(x)))to avoid overflow with large products. It accepts double values (other numeric types convert implicitly) and includes full parallel execution support. The function returnsnullfor negative values, zero values, or empty groups, following DuckDB semantics. Constant folding optimization is included wheregeomean(c) = cfor any positive constant, avoiding aggregate machinery overhead.This feature enables the query optimizer to recognize
DATEADD()calls in time intrinsics and transform them for better performance. When a query containsWHERE dateadd('m', 15, timestamp) = '2022-03-08T18:30:00.000Z', the optimizer transforms it toWHERE timestamp = dateadd('m', -15, '2022-03-08T18:30:00.000Z'), allowing existing interval-based partition pruning to work effectively. The implementation adds AST rewriting toWhereClauseParserand handles comparison operators andBETWEENexpressions.This feature removes the restriction that prevented window functions from being used as arguments to other functions, enabling queries like
SELECT abs(row_number() OVER ()) FROM tand nested window functions such asSELECT sum(row_number() OVER ()) OVER () FROM t. The implementation includes O(1) hash-based deduplication that computes identical window functions only once, significantly improving performance for queries with duplicate window function calls. This feature adds a nesting depth limit of 8 levels to prevent excessive recursion and provides clear error messages when window functions are incorrectly used inPARTITION BYorORDER BYclauses of window specifications.This feature adds four new SQL functions optimized for spatial filtering, particularly useful for indexing lidar scans and local area queries. The
within_box()function checks if a point is within a rectangular bounding box using Cartesian coordinates, whilewithin_radius()performs circular radius checks. For geographic coordinates,geo_within_radius_latlon()andgeo_distance_meters()use equirectangular projection for fast local distance calculations with ~0.1% accuracy for distances under 100km. The implementation includes branchless bit manipulation for optimal vectorized execution, constant folding optimization for compile-time precomputation when reference points are constants, and comprehensive coordinate validation for latitude (-90 to 90) and longitude (-180 to 180) ranges.This feature enables the query optimizer to recognize
ORcombinations of timestampINpredicates as intrinsic interval filters, allowing efficient interval forward scans instead of falling back to row-by-row JIT filtering. Previously, queries likeWHERE timestamp IN '2018-01-01' OR timestamp IN '2018-01-02'would use slow JIT filters, but now they utilize the same efficient interval scanning as singleINpredicates. The implementation extends the WhereClauseParser to extract timestamp intrinsics from OR trees and adds interval union functionality to combine multiple date ranges into optimized scan operations.This feature adds the
LENGTH_BYTES()SQL function that returns the number of bytes in a varchar argument. The implementation includes performance optimizations forMIN(varchar)andMAX(varchar)group-by functions by using prefix-first comparison with 6-byte prefixes stored in aux memory, and 8-bytes-at-a-time comparison usinglongAt()instead of byte-by-byte access when prefixes match.This feature introduces TICK, a powerful syntax for expressing complex temporal intervals in QuestDB. It enables concise specification of multiple disjoint time intervals with timezone awareness in a single expression. Key capabilities include bracket expansion for dates and times, range expansion with inclusive ranges, timezone support with DST awareness, day-of-week filtering, multi-unit duration specifications, ISO week date format support, dynamic date variables with arithmetic, and Cartesian product combinations. TICK supports features like workday filtering, per-element timezones, business day arithmetic, and complex scheduling patterns.
This feature enables pushing timestamp predicates through virtual models where the timestamp column is derived from a
dateadd()function, allowing partition pruning to work even when queries transform the timestamp usingdateadd. The implementation detectsdateadd(unit, constant, timestamp)patterns inSELECTclauses and annotates models with offset info, wraps pushed predicates inand_offset(predicate, unit, offset)during optimization, and supports chained dateadd through multiple nested subquery levels. Auto-detection of timestamp columns from dateadd expressions works without requiring explicittimestamp(ts)clauses.This feature implements the SQL standard
PERCENT_RANK()window function that returns the relative rank of the current row calculated as(rank - 1) / (total_rows - 1). The function returns 0 if there is only one row in the partition. The implementation includes three specialized classes to handle different use cases: no ORDER BY clause (returns 0 for all rows), ORDER BY without PARTITION BY (uses two passes to compute ranks), and both PARTITION BY and ORDER BY (tracks per-partition row counts).This feature extends the TICK (Temporal Interval Calendar Kit) date variable arithmetic to support additional time units beyond days and business days, including years (
y), months (M), weeks (w), hours (h), minutes (m), seconds (s), milliseconds (T), microseconds (u), and nanoseconds (n). Date variable expressions can now be written with or without brackets, making syntax more flexible. Calendar-aware units properly handle varying month lengths and leap years, while sub-day units preserve full microsecond/nanosecond precision when used with$now. This enables precise timestamp arithmetic in interval expressions for high-frequency data analysis.
Improvements
This improvement reduces contention across concurrent queries by avoiding unnecessary partition closures when partitions are in standby for lazy open and by performing
mmapoperations outside of synchronized blocks inMmapCache. These changes eliminate bottlenecks that could impact query performance under high concurrency scenarios.This improvement optimizes queries like
SELECT dateadd('m', -15, max(ts)) FROM twhereMAX()aggregate function is used as a function argument. The optimization movesrewriteSingleFirstLastGroupByafterrewriteSelectClauseto enable better query execution plans for timestamp aggregation functions.This improvement optimizes Parquet partition read performance through four key enhancements: reduces memory allocations during page decoding by eliminating intermediate copies for fixed-width columns and batching memory operations; switches default compression from ZSTD(level=9) to LZ4_RAW for faster decompression at the cost of slightly larger files; uses REQUIRED repetition for non-null Symbol columns to skip definition-level decoding; and skips redundant decode operations on ParquetBuffers cache hits which particularly benefits ASOF JOIN scenarios. Combined, these changes deliver approximately 6x read performance improvement compared to the previous implementation, with query times improving from ~17 seconds to ~3.56 seconds in benchmark tests.
This feature adds streaming outputs support and unifies generate/fix/chat/schema flows in a single interface. The enhancement includes a retry mechanism for failed requests, fixes rerendering issues in the chat window, adds abort signals to model calls, and provides a
get_table_detailstool for investigatingtables()query results. Additionally upgrades Monaco Editor components and resolves content display issues in LiteEditor.
Bug Fixes
This fix resolves an issue where Influx Line Protocol data would be written to the wrong table when a table was renamed while keeping the connection open. The problem occurred because table references were cached in the state of each HTTP connection. The fix introduces a global version counter for renames that triggers cache flushing, WAL rollback, and error return for retry when a rename is detected, preserving request atomicity.
This fix resolves a crash when using window joins with large slave tables. The issue occurred because
AsyncWindowJoinRecordCursor.close()was freeingslaveTimeFrameAddressCachebefore waiting for worker threads to finish, causing workers to access freed memory and trigger segmentation faults. The fix reorders operations to await workers before freeing shared resources and wraps cleanup in try-finally blocks to ensure resources are always freed.This fix prevents segmentation faults that could occur when a timeout happened during vectorized parallel
GROUP BYqueries. The issue was caused by frame memory pools used by worker threads being released without properly waiting for all tasks to be processed, creating a race condition where workers could access freed memory concurrently with the aggregation process.This fix resolves an issue where SQL line comments containing unbalanced single quotes (e.g.,
-- magic ') would cause the lexer to incorrectly enter string-parsing mode and consume tokens past the newline, resulting in subsequent SQL tokens being lost from the query. The fix improves the line comment parsing logic to properly detect newlines within tokens that span multiple lines and repositions the lexer correctly to reset its state.This fix resolves failures in
GROUP BYqueries with many aggregate functions and includes significant performance optimizations for SQL compilation. AddedSimpleGroupByFunctionUpdateras a loop-based fallback that avoids JVM bytecode limits when aggregate function count exceeds 32. Optimized alias generation from O(n²) to O(1) amortized complexity usingLowerCaseCharSequenceIntHashMapto track sequence numbers. Improved duplicate aggregate detection from O(n²) to O(n) using hash-based comparison withExpressionNode.deepHashCode(). Query compilation time for queries with 6K aggregate functions improved from 4 seconds to around 300ms.This fix resolves an issue where symbols defined with
DECLAREstatements were not properly recognized when used as function arguments during SQL parsing.This fix resolves Parquet read failures that occurred in window chain join queries. The issue was caused by chained window joins incorrectly using
GenericRecordMetadata.copyOf()instead of the requiredGenericRecordMetadata.deepCopyOf()method.This fix resolves a bug where
WINDOW JOINqueries using aliased columns in expressions with aggregates would fail with an 'Invalid column' error. The issue occurred when the same column was both aliased and used in an expression with aggregates. The fix ensures proper column resolution by passing the correct model to aggregate processing and handling the special case where translating and group-by models are the same in window joins.This fix resolves an issue where executing multiple PostgreSQL Wire Protocol commands with the same SQL but different parameters caused the second command to return corrupted data. The problem occurred because
msgBindCopySelectFormatCodes()incorrectly usedfactory != nullto determine if format codes should be set, but when a prepared statement was reused viacopyIfExecuted(), the new entry hadfactory=null, causing format codes to be skipped and clients to misinterpret binary/text format. The fix replaces the factory check withhasResultSet()which usessqlTypefor proper determination.This fix resolves a bug where
ORDER BY <position>failed with "Invalid column" error when used with a CTE containing aggregation and a window function expression. The issue occurred because the position-to-name resolution logic inSqlOptimiser.rewriteOrderByPosition()only checked for GROUP_BY models when determining which model's columns to use, but with window functions the model structure includes a WINDOW model between VIRTUAL and GROUP_BY models. The fix extends the condition to also check forSELECT_MODEL_WINDOWtype.This fix resolves Parquet export compatibility issues with strict readers like PyArrow and Trino for SYMBOL columns spanning multiple row groups. The fix generates a single dictionary page per column chunk instead of incorrectly writing multiple pages, and corrects RLE/bitpack encoding padding to ensure proper format compliance.
QuestDB Enterprise 3.2.0 brings the new Views feature, as well as a host of performance upgrades, small features, and bug fixes in the QuestDB 9.3.x release family.
New Features
This feature provides the QuestDB Enterprise component for views functionality, enabling users to create virtual tables based on SQL queries.
Bug Fixes
This fix automatically detects native libraries in jlink-ed runtime images by using java.home to locate libs in $JAVA_HOME/lib/, eliminating the need for
-Dquestdb.libs.dirin standard distributions. Additionally, this fix provides actionable error messages when library loading fails, explaining solutions such as using the official distribution, setting-Dquestdb.libs.dir, or fixing noexec tmp issues.
QuestDB 9.3.1 follows the major 9.3.0 release, focusing on stability, correctness, and performance refinements based on early feedback and production usage. This release delivers important fixes across joins, views, and checkpointing, alongside continued performance improvements on hot SQL execution paths.
New Features
This feature implements
KSUM()as a window function using the Kahan summation algorithm for improved floating-point precision. This complements the existingKSUM()aggregate function by enabling its use in window contexts and supports all standard window frame types including ROWS, RANGE, partitioned, unbounded, and sliding windows. The function can be used for whole partitions, cumulative sums, sliding windows, and range-based windows.This feature enables window functions to participate in arithmetic expressions and other operations. Previously, window functions could only appear as standalone
SELECTcolumns. Window function parsing has been moved fromSqlParsertoExpressionParser, allowing window functions to be parsed as part of the expression tree. The implementation maintains zero-GC compliance and includes proper nested window function detection to reject invalid contexts while allowing valid use cases likeCASEexpressions.This feature exposes the actual data timestamp range in each table through new columns in the
tables()function. Thetable_min_timestampandtable_max_timestampvalues are updated when WAL transactions are merged into tables and during startup hydration. Additionally,table_max_timestampwas renamed totable_last_write_timestampfor clarity, anddedup_row_count_since_startwas renamed towal_dedup_row_count_since_startfor consistency.
Improvements
This improvement implements streaming Parquet export for HTTP queries, eliminating temporary table/files and improving performance. The feature exports directly from
PageFrameswhensupportsPageFrameCursor()is true, with fallback to creating temp table then streaming export whenPageFrameCursoris not supported. Currently available for HTTP/expendpoint only, this change provides modest performance improvements and enables future parallelization opportunities for row group generation.This improvement significantly reduces GC pressure and jitter in parallel queries through several key optimizations. The
PageFrameAddressCachenow uses flatDirectLongListstructures instead of nested objects, reducing ~20K object allocations per 1000 frames to just 4 contiguous off-heap lists. Decimal flyweights in map values are lazily initialized to avoid allocations when unused. TheGroupByAllocatorreplacesLongLongHashMapwith off-heapDirectLongLongHashMapto prevent per-worker allocations. Page frame size calculations are improved to avoid tiny trailing frames, and work stealing strategy is tuned with a newcairo.sql.parallel.work.stealing.spin.timeoutconfiguration (default 50µs). Benchmarks show 67% fewer GC events, 57% less memory usage, and significantly reduced GC pause times.This improvement materializes functions within the same-level
queryModelthat are referenced by other projection columns, eliminating redundant expression evaluations and improving query performance when columns depend on other computed columns.
Bug Fixes
This fix resolves the
left side of time series join doesn't have ASC timestamp ordererror that occurred whenORDER BYtimestampDESCwas used in aWINDOW JOINquery.This fix properly handles tables created after a checkpoint was taken during checkpoint restore. Previously, table directories created after the checkpoint would remain on disk, potentially causing inconsistencies or orphaned data. The solution iterates over all table directories in the database root after recovering checkpoint contents, compares each directory against the checkpoint contents, and removes any table directories that do not exist in the checkpoint.
This fix resolves two race conditions in view state management. The first issue occurred when multiple
ViewCompilerJobtasks ran concurrently, allowing a task with an older timestamp to overwrite newer, correct view state. A timestamp check was added toupdateViewState()to skip updates when the update timestamp is older than the current view state timestamp. The second issue involved views being created with empty metadata that was hydrated before asynchronous compilation completed, causing queries to see emptydesignatedTimestampintables(). This fix passes compiledRecordMetadatadirectly tocreateViewState()so proper column information is available immediately.This fix resolves an issue where views became suspended after being altered due to missing metadata. The fix ensures metadata is properly restored after view alterations, preventing the suspended state that would make views unusable.
This fix addresses an edge case in materialized view column merging where fixed columns could be corrupted during merge operations. The fix ensures data is copied to the correct memory location (end of old data + column tops) rather than to the end of newly calculated data size, preventing potential data corruption in materialized views.
This fix resolves an issue where the dedup logic would leave behind unused partition directories from O3 merge preparation when appending data directly to existing partitions. The Partition Purge job would then incorrectly count these orphaned directories as the next valid partition version, leading to incorrect partition version tracking and subsequent file not found errors.
This fix resolves
UnsupportedOperationExceptioncrashes inASOF JOINqueries when using the Light join factory with multiple join keys including a symbol column. The single-symbol optimization was incorrectly applied when there were multiple join keys, creating a map that only supported INT keys while the record copier attempted to write all column types. In some cases this could also silently return incorrect results.This fix resolves a race condition where jemalloc background threads were failing to start reliably (~20% failure rate) when QuestDB was launched via questdb.sh, preventing memory from being released back to the OS properly. The solution switches to QuestDB's own jemalloc fork and updates to the latest dev branch commit. Additionally,
LD_PRELOADis now used to load jemalloc only in the Java process rather than the bash script.This fix removes unwanted scrolling behavior on wheel events from code blocks and assistant suggestions, adds "Open in editor" action to all overflown LiteEditor instances, prevents unnecessary rerenders in assistant response markdown preview, and resolves a bug with context badge clicks where highlighting would fail on removed or modified context queries.
This fix resolves a flickering issue when hovering over news thumbnail images in the right panel. The zoom magnification was rapidly appearing and disappearing due to hover state issues caused by the
hoverTimeoutvariable resetting on every re-render. The fix usesuseReffor hover timeout persistence across re-renders and tracksimageToZoomstate with a ref to prevent dismissal when zoom is already visible. Additionally, click-to-dismiss functionality was added on both the overlay and zoomed image for better user experience.This fix addresses a regression where HTML entities like
,<, and>displayed as literal text instead of their corresponding characters in grid cells. AnunescapeHtmlfunction was added to decode HTML entities to their actual characters before settingtextContent, restoring proper entity rendering while maintaining XSS protection that was implemented in the previous security fix.
QuestDB 9.3.0 is now available, bringing a new wave of query expressiveness and usability improvements across the engine and the Web Console. This release introduces window joins for precise time-based analytics, database views for cleaner query composition, AI-assisted workflows in the web console, and the new PIVOT keyword for effortless wide-schema aggregations.
New Features
This feature introduces a new
WINDOW JOINsyntax that enables time-based aggregations by joining each row from the left table with matching rows from the right table within a specified time window. The syntax supports bothINCLUDE PREVAILING(default) andEXCLUDE PREVAILINGclauses, where the former includes the latest right-hand table row before the interval. The implementation uses SIMD instructions for symbol-based joins and supports multi-threaded execution for large datasets. Window joins can be chained in the same query but cannot be mixed with other join types at the same query level.This feature refactors the
glob()function to use native glob matching instead of Java regex, providing compatibility with DuckDB's glob syntax. The implementation supports standard glob wildcards including*for any number of characters,**for subdirectories,?for single characters,[abc]for character sets, and[a-z]for character ranges.This feature improves table drop performance by closing idle WAL writers faster, which enables quicker deletion of WAL and table files from disk. Previously, dropped tables would generate repeated logging about pending WAL segments, but now the cleanup process completes more efficiently.
This feature enables using VARCHAR array bind variables with symbol, varchar, and string columns in
INexpressions through PostgreSQL Wire Protocol. This allows reusing prepared statements with different value lists, reducing query compilation overhead for applications that frequently execute similar queries with varying parameter sets.This feature pushes column projection down to the Parquet decoder, reading only required columns instead of all columns. The optimization also shares Parquet metadata across threads for multi-threaded reads, parsing metadata once per file. Performance improvements are dramatic - ClickBench Q1 execution time improved from 712 seconds to 787ms with multithreading enabled.
This feature adds real-time table write statistics tracking through new columns in the
tables()SQL function. New columns includerowCount(approximate row count),pendingRowCount(WAL rows not yet applied),dedupeRowCount(cumulative deduplicated rows),lastWriteTimestamp(last commit timestamp),writerTxnandsequencerTxn(transaction numbers),memoryPressureLevel(0-2 pressure indicator), and various histogram statistics for transaction sizes, write amplification, and merge throughput. The feature uses a new RecentWriteTracker with lock-free reads and bounded memory, supporting both WAL and non-WAL tables with different column semantics. Statistics are approximations updated when writers return to pool, with WAL tracking being real-time and LRU eviction maintaining memory bounds.This feature introduces the
PIVOTkeyword, which is a specializedGROUP BYquery that helps group a selection of rows into columns, essentially pivoting from a narrow schema to a wide schema. The implementation supports single and multiple columns, per-column aliasing,FOR-INexpressions with single or multiple expressions,GROUP BY,ORDER BY,LIMIT, and dynamicINlists using subqueries. The feature also includesUNPIVOTfunctionality to reverse simple pivot operations.This feature introduces support for database views, which are virtual tables defined by a
SELECTstatement. Views do not store data themselves; instead, their defining query is executed as a sub-query whenever the view is referenced. The implementation includesCREATE VIEW,DROP VIEW,ALTER VIEW,CREATE OR REPLACE VIEW,COMPILE VIEW,SHOW CREATE VIEW, andviews()commands. Views automatically recompile when operations occur on their dependencies and support metadata persistence, state management, and integration with the Web Console.This feature introduces a comprehensive AI Assistant with conversational interface in the right sidebar, enabling users to interact with AI models (Anthropic Claude & OpenAI GPT) for SQL assistance. The assistant provides query explanations, automatic error fixing, schema explanations, and SQL generation from natural language. It includes Monaco Editor integration with glyph margin icons, inline diff view for reviewing AI suggestions, and persistent chat history stored in IndexedDB. The feature supports multiple AI providers with configurable models and includes token usage tracking and conversation compaction for context management.
Improvements
This improvement prevents
LimitRecordCursorFactory.toPlan()from calling the expensivecalculateSize()method duringEXPLAINqueries, which previously executed the complete query. The optimization also avoids unnecessarycalculateSize()calls inhasNext()whenLIMITbounds are non-negative and insize()when the size isn't cheaply available. Additionally, this improvement clarifiesLIMITsemantics to align with documentation and fixes related bugs in cursor classes.This improvement introduces comprehensive optimizations to JIT-compiled filter execution including short-circuit evaluation for scalar
AND/ORpredicate chains, automatic predicate reordering by estimated selectivity, and multiple caching mechanisms to reduce redundant memory loads. The optimization includes SIMD scatter short-circuiting that skips expensive operations when no matches are found in a batch. Performance improvements range from 23-69% faster execution depending on the query pattern, with the biggest gains seen in highly selective filters.This improvement implements a fast-path for count-only queries that avoids materializing matching rows by incrementing a counter instead. The optimization eliminates row ID scatter logic, memory writes, and conditional branches in the hot loop, reducing loop instructions by ~70%. Performance improvements range from 60-92% for queries with many matching rows, with the biggest gains for NEQ predicates that match large result sets.
This improvement enables parallel processing of the top-K selection phase when
GROUP BYresults are sharded due to high cardinality. The optimization processes each shard in parallel using worker threads, then merges results. New configuration propertiescairo.sql.parallel.groupby.topk.threshold(default 5M) andcairo.sql.parallel.groupby.topk.queue.capacitycontrol when parallel execution is enabled. ClickBench queries show speedups ranging from 1.3x to 11.2x for theORDER BY+LIMITphase.This improvement optimizes non-keyed
GROUP BYqueries withMIN()orMAX()functions over designated timestamp columns by using O(1) operations instead of SIMD native functions. The optimization takes the first/last value directly, resulting in significant performance gains for queries likeSELECT MIN(EventTime), MAX(EventTime) FROM table.
Bug Fixes
This fix addresses a crash that occurred when calling
touch()on a table with newly added columns. When a new column is added to a table, it is initially empty and uninitialised, causingframe.getPageAddressto return a null pointer address. Thetouch()function did not check for this possibility and would attempt to use the null pointer, resulting in a segmentation fault. This fix adds proper null pointer checks for both columns and indexes to prevent crashes.This fix adds support for the
SHOW default_transaction_read_onlymetadata query, which is required by npgsql when building data sources with multiple hosts. Without this query support, connection sources would fail to initialise when using PostgreSQL Wire Protocol clients that depend on this metadata.This fix resolves an issue where the
tables()function would display an incorrect designated timestamp column when a column type change occurred on any column positioned before the designated timestamp column in the table schema. The function now correctly identifies and displays the designated timestamp column regardless of schema modifications.This fix resolves a limitation that prevented copying data between tables with thousands of columns (6000+) due to Java bytecode method size restrictions. The solution introduces multiple copy strategies: a looping variant for very wide tables, and a chunked variant that splits large copy operations into smaller sub-methods to maintain optimal performance. The chunked approach allows each sub-method to be individually optimized by the C2 compiler, significantly improving performance for wide table operations while maintaining compatibility with extremely wide schemas.
This fix resolves an issue where nested
SAMPLE BYqueries with anORDER BYclause incorrectly failed with "ASC order over TIMESTAMP column is required but not provided". The problem occurred whenSAMPLE BYpushedtimestampRequired=trueonto a context stack, which propagated to nested models even when anORDER BYclause would re-sort the data anyway. This fix ensures that when anORDER BYclause is present,timestampRequired=falseis pushed to signal that nested models don't need to maintain timestamp ordering.This fix resolves an issue where
EXPLAIN UPDATEon WAL tables was throwingUnsupportedOperationExceptionbecause the anonymousRecordCursorFactorycreated during WAL serialization did not implementgetCursor(). The fix replaces the anonymous factory withEmptyTableRecordCursorFactoryand improves the explain plan output to show table names when available.This fix resolves issues when using quoted column names containing dots in
JOINoperations, ensuring proper metadata handling for such column references.This fix resolves a logic bug that caused each inserted row during Parquet export processes to be committed individually as single-row commits, significantly impacting performance during data copying operations.
This fix improves native library loading by auto-detecting native libraries in jlink-ed runtime images. The improvement uses
java.hometo locate native library directories and only activates when running from jrt: protocol (jlink runtime), eliminating the need for-Dquestdb.libs.dirin start scripts.This fix addresses an overflow issue when incrementing quotient digits during decimal division. The increment loop in
DecimalKnuthDivider.javawas updated to use the actual length of theqarray, preventing possible misses when incrementing quotient digits during rounding and improving the reliability and correctness of the decimal division logic.This fix resolves an issue where
VARCHARcolumns larger than the PostgreSQL Wire Protocol send buffer would fail, making behavior consistent with array columns which already supported fragmented sending. Additionally, this fix addresses a latent bug where the partial-send array offset wasn't being reset when a row couldn't fit in the buffer and had to be abandoned viaresetIncompleteRecord(), which would create malformed packets on retry.Query keys are in form:
${queryText}@${startOffset}-${endOffset}. WithDECLAREstatements, the Web Console was parsing queries incorrectly, causing the AI chat window to open with wrong information and the editor to shift query keys incorrectly.This fix prevents execution of foreign code inside the Web Console through malicious queries containing HTML iframe elements with embedded scripts.
This fix adds functionality to abort queries in progress while the query body is being consumed. For running query and running all queries in a tab, it aborts the ongoing process and starts the new one if a new operation comes before the old one is completed. It also prevents the editor from inferring
'\t'as a valid query start character by treating it the same as' '.This fix adds exponential backoff retry with 5 attempts and delays of 1s, 2s, 4s, 8s, 16s (capped at 30s) when POSTing telemetry to fara.questdb.io. The hourly telemetry loop now continues even after max retries are exhausted, and if the
/configendpoint fails during startup, the loop still starts and retries the next hour. Previously, a single failed POST would permanently stop the telemetry loop until the user refreshed the page, causing telemetry gaps during transient network issues or brief outages.
This release focuses primarily on bug fixes and performance improvements for replication functionality. Key fixes include resolving suspended table issues on replicas, optimizing segment collection at startup, and addressing metadata file corruption after materialized view recreation. Additional improvements were made to SQL query processing and session management stability.
Improvements
This improvement enhances startup performance by modifying the directory walking logic to skip partition traversal and moving the missed segment collection to run asynchronously as part of each table uploader's logic. The change eliminates startup blocking and enables parallel directory traversal across multiple table uploaders.
Bug Fixes
This fix resolves a race condition where the
_eventfile was copied to the segment directory before the_event.ifile during replication. WhenWalEventReaderattempted to map the_event.ifile using the max transactions value from the already-updated_eventfile, it could result in reading beyond the file size via mmap, causing a SIGBUS signal and subsequentjava.lang.InternalError. The fix ensures that_event.ifile is copied before the_eventfile to maintain proper synchronization.This fix ensures that outer
LIMITclauses are properly applied even when the subquery contains both a filter condition and its ownLIMITclause, preventing incorrect query result truncation.This fix resolves use-after-free bugs in
HttpSessionStorecaused byOidcUserInfoHoldercontainingCharSequencesbacked by parser buffers with limited lifetime. The fix uses immutable String copies fromCachedUserInfoinstead of passing parser buffer references directly, preventing memory corruption when parser state is reused.
QuestDB 9.2.3 is a patch release bringing the usual swathe of fixes, performance enhancements, and a few new features. You'll see higher QPS, faster GROUP BY queries, and better performance for high-contention materialized views. There will be one more release before Christmas, bringing the new WINDOW JOIN syntax.
New Features
This feature provides users with actionable hints when histogram out of bounds errors occur during
APPROX_MEDIAN()queries. The error messages now suggest using a lower precision value or normalizing values to fit within the allowed range, helping users resolve issues when inserted values exceed the histogram's default precision limits.This feature prevents accidental data loss when future timestamps are inserted by using
min(maxTimestamp, wallClockTime)for TTL enforcement instead of justmaxTimestamp. A new configuration optioncairo.ttl.use.wall.clock(default: true) controls this behavior. Additionally optimizes WAL commit performance by reducing microsecond clock syscalls from 2-3 to 1 per commit.
Improvements
This improvement restricts
Unordered4MapandUnordered8Mapusage to single-column cases to avoid false sharing issues with jemalloc. The change removesUnordered2Mapentirely as it showed no performance benefit. This optimization significantly reduces execution times for certain ClickBench queries when jemalloc is in use.This improvement enhances materialized view performance by intelligently skipping transactions that are fully replaced by future commits or full refresh operations, reducing unnecessary write operations.
This improvement enables jemalloc by default in Linux x86_64 binary distributions, providing significant performance boosts for allocation-heavy queries such as parallel
SAMPLE BYandGROUP BYoperations. The improvement also fixes the built-in jemalloc binary which was previously built with theje_prefix in exported symbols, preventing proper memory allocator replacement. The built-in jemalloc binary is now limited to Linux only, andquestdb.shwill return an error whenQDB_JEMALLOC=trueis set on other operating systems.This improvement significantly reduces SQL expression parsing latency across various expression types. Performance improvements range from approximately 3x faster for simple expressions like
a + b(from 2967ns to 991ns) to over 3x faster for complex expressions like window functions withOVERclauses (from 3467ns to 1232ns). Complex case statements, casting operations, and conditional expressions also see substantial performance gains.This improvement removes redundant division operations from the
AVG()function in parallel non-keyedGROUP BYqueries. The nativeavg*Acc()functions were usingavg += (v - avg) / cin each iteration while the outerAvg*VectorAggregateFunctionJava function was multiplying the average and count before storing them in partial results. Benchmarks show performance improvements of approximately 18% for simple averaging queries.
Bug Fixes
This fix resolves a race condition in
PartitionPurgeJobthat could cause transient query failures with partition directory errors. The issue occurred when the purge job detected partition versions that were temporarily created by TableWriter for copy-on-write operations but then removed before commit when deduplication determined the operation wasn't needed. The fix ensures that the next partition version exists before deleting overwritten partition versions, preventing phantom partitions from being counted as valid.This fix resolves segmentation faults in parallel
ORDER BYwithLIMITqueries that involve JIT-compiled filters. The crash was caused by missing bind variable memory initialization and occurred both with explicit bind variables and when filters involved non-existing symbols. As a workaround, users can setcairo.sql.parallel.topk.enabled=falseto disable the problematic execution plan.This fix removes the unix line-ending configuration that was incorrectly changing byte sequences in the Web Console package downloaded from NPM, which was causing broken assets. The exclusion workaround is no longer needed.
This fix resolves an issue where Parquet exports would fail when the SQL query contained the
NOW()function.This fix resolves an issue where queries with
LIMITclauses would incorrectly return empty result sets.This fix resolves a crash that occurs when ordering by a symbol column that was added via
ALTER TABLE ADD COLUMN.This fix resolves inefficient refresh queries when materialized views use small sampling intervals (e.g., microsecond) on tables with sparse data distribution. The improved estimation ensures optimal batch sizes for incremental refresh.
This fix resolves a data corruption issue that could occur during range replace transactions on partitions with column tops. The bug affected the
O3OpenColumnJob#merge*Column()methods which could overwrite data in the source partition, making it corrupted. The issue was specific to range replace operations combined with column tops, which only occurs in materialized view scenarios.This fix resolves incorrect content length headers in HTTP JSON responses that could cause client parsing issues or connection problems.
This fix prevents connection contexts from being closed and pooled multiple times when TLS session initialization fails, which could cause issues with subsequent connections.
This fix prevents unexpected exceptions that could occur when the schema of a Parquet file changes between reads, ensuring more robust file processing.
This fix resolves an issue where using
SAMPLE BY ... FILL(NULL)on tables containing decimal columns and then filtering forNULLvalues in the outer query would throwUnsupportedOperationException: Record.getDecimal128(). The issue occurred becauseFillRangeRecordCursorFactory.FillRangeRecordandSelectedRecordclasses were missing implementations for thegetDecimal*()methods. This fix adds support for all decimal types in both classes, enabling proper gap filling with null values for decimal columns and correct delegation of decimal column access to the base record.This fix resolves an issue where queries with multiple
LIMITclauses would ignore the outerLIMITwhen one of them was in a subquery. The problem was tracked by a boolean flag on the model that could not handle multipleLIMITclauses properly. This fix moves the tracking to the node expression itself, allowing eachLIMITto be tracked independently by the node that represents it in the model tree. The redundantfollowedLimitAdvice()method was also removed fromRecordCursorFactoryas it served the same purpose asimplementsLimit().This fix ensures that when adding a symbol column via
ALTER TABLE ADD COLUMNto a table with existing data, the symbol table'snullValueflag is properly set to true since all existing rows will have null values for the new column.This fix resolves an issue where attempting to update a
TIMESTAMPcolumn using bind variables would fail with the error "bind variable cannot be used [contextType=unknown, index=1]". TheUPDATEstatement now correctly handlesTIMESTAMPcolumns when bind variables are used.This fix resolves an issue where queries with aggregate functions inside
CASEexpressions were failing with "Aggregate function cannot be passed as an argument". The root cause was thatcheckForChildAggregates()inSqlOptimiseronly traversedlhsandrhschildren of expression nodes, missing aggregate functions inCASEexpressions which store arguments innode.args. The optimizer now properly handles aggregate functions nested inside expressions with 3 or more arguments.
QuestDB Enterprise 3.1.1 combines the latest advancements from OSS releases 9.1 and 9.2, bringing powerful new data types, nanosecond-precision time-series support, and significant query performance improvements. This release introduces native DECIMAL support with up to 76-digit precision, nanosecond timestamps, and delivers substantial performance gains through optimized ASOF JOINs.
Breaking Changes
This change enables multi-threaded execution for
SAMPLE BYqueries that combine timestamp and aggregate functions in their selected expressions. The improvement also optimizes execution plan selection forSAMPLE BYandGROUP BYqueries by avoiding duplicate group keys and reusing existing ones. However, this introduces a breaking change in how designated timestamps are handled when selected multiple times under different aliases - the query engine now picks the first ordered column, with preference given to columns used inORDER BYclauses or unaliased columns matching the designated timestamp name.This fix resolves data inconsistency issues in chained materialized views with period refresh, where dependent views could contain rows no longer present in their base views. The change includes a breaking modification where MANUAL PERIOD views are no longer automatically refreshed when periods end - they must now be refreshed explicitly using
REFRESH MATERIALIZED VIEWSQL, consistent with other MANUAL views.This change removes transaction scoreboard v1 implementation. Deployments previously configured with
cairo.txn.scoreboard.format=1will continue to function, but the configuration option is now ignored and the server will automatically use scoreboard v2 instead.
New Features
This feature integrates the
DECIMALtype into QuestDB, making it available through SQL, Influx Line Protocol, and PostgreSQL Wire Protocol. TheDECIMALtype supports precision up to 76 digits using the syntaxDECIMAL(precision, scale)where precision defines the maximum count of digits and scale defines the fractional part digits. This feature includes basic arithmetic operations (+,-,/,*,%), comparison operators, casting functions, aggregate functions (min,max,avg,sum,count), and support forGROUP BY,ORDER BY,SAMPLE BY, and CSV import/export. Unlike traditional numeric types, operations that cannot provide accurate values will throw an error, except for division which returns aHALF_DOWNrounded quotient.This feature introduces nanosecond precision timestamp support through the
TIMESTAMP_NStype. The implementation includes core nanosecond arithmetic and parser/formatter functionality, with the timestamp resolution carried by the data type similar toGEOHASHtypes. This feature supportsSAMPLE BY,ASOF JOIN, materialized views,PARTITION BY, O3 operations, Parquet file writing, CSV text import, and integration with Influx Line Protocol clients for Java, C, and Python. The partition metadata includes timestamp type information, with legacy code defaulting to microsecond precision.This function generates high-cardinality symbols with skewed distribution patterns, making it useful for testing algorithms that work with realistic symbol distributions where some values appear much more frequently than others.
This function provides array aggregation capabilities by returning the last array value in a group. This feature utilizes the specialized allocator for aggregate functions introduced in previous versions.
This function returns the most frequent value in each group, eliminating the need for complex subqueries involving
count()operations and sorting. The implementation supportsBOOLEAN,LONG,VARCHAR,STRING,SYMBOL, andDOUBLEdata types with parallel execution capabilities. It uses a map-of-maps structure where outer maps store pointers to inner maps that aggregate per-key counts, which are then merged and scanned for the most frequent value. This feature is useful for identifying patterns like the most frequently traded symbol by a trader or the most common sensor state.This feature adds
weighted_avg(),weighted_stddev_rel(),weighted_stddev_freq()functions with an aliasweighted_stddev()forweighted_stddev_rel(). According to Wikipedia, there are two variants for the unbiased estimator of population variance: one for frequency weights and another for reliability weights. A frequency weight represents the number of occurrences of the sample in the dataset, while a reliability weight represents the "importance" or "trustworthiness" of the given sample. The implementation provides both functions, withweighted_stddev_relfor reliability weights andweighted_stddev_freqfor frequency weights.This feature introduces Right Outer Join and Full Outer Join functionality, expanding the range of join operations supported. For
HashLightInnerJoinandHashLightFullOuterJoin, the implementation now supports limited runtime reordering of left and right cursor, allowing the join to dynamically choose the more efficient build/probe table combination in certain scenarios. Right Outer Join provides users with an alternative when Left Outer Join performance is unsatisfactory.
Improvements
This improvement introduces the Dense
ASOF JOINalgorithm, a variant of the Light algorithm that uses binary search to quickly jump to the timestamp matching the first left-hand row instead of starting the scan at the top. This difference is highly important when the right-hand table has history that predates the first left-hand row, as the entire history will be skipped except for recent history needed to find matches of initial few left-hand rows. The optimization also enhances symbol-to-symbol joins for existing Light and Fast algorithms by copying just the symbol key instead of the symbol string through the RecordSink.This improvement leverages the symbol index on the right-hand side of
ASOF JOIN, avoiding the need to visit any rows that don't have the matching symbol. The major use case is markout analysis, where dealing with illiquid assets having their last price posted way back in the past would previously require scanning all rows backwards until finding a match. The optimization makes only one symbol index lookup per partition instead of scanning the full historical data for all symbols.This improvement optimizes keyed
ASOF JOINoperations on non-indexed symbol columns by implementing a memoization strategy. When searching for matching rows in the right-hand side table, the algorithm now remembers the location of each symbol it encounters and reuses that knowledge for future searches. It also remembers where symbols were not found, allowing it to skip already-scanned regions. This optimization is particularly effective for scenarios with low-frequency symbols where the same symbol appears rarely in the right-hand side table, reducing redundant linear scans from taking indefinite time to completing in seconds.This improvement refines the memoized scan algorithm for
ASOF JOINon symbol columns by making drive-by caching optional through a query hint. While match caching (remembering found symbol locations) remains active by default with minimal overhead, drive-by caching (storing all symbols encountered during searches) is now behind a query hint due to its hash lookup overhead. This change improves performance in typical cases where drive-by caching doesn't provide sufficient benefit to offset its computational cost.This improvement introduces deferred merge in parallel
COUNT_DISTINCT()functions for int, ipv4, long, uuid, and long256 types. Instead of merging hash sets immediately when mergingGROUP BYhash tables, the merge is deferred until thegetLong()call. This avoids redundant hash set merges in queries withORDER BYandLIMITclauses where not all groups need their distinct counts calculated. Additional optimizations include inlining single keys inMapValueforCOUNT_DISTINCT(int)andCOUNT_DISTINCT(long), and implementing a faster FxHasher-based hash function with reduced load factor.This improvement extends the fast-path optimization for
ORDER BYwithLIMITqueries to work with virtual columns, expressions, and subqueries that previously usedVirtualRecordCursorFactoryandSelectedRecordCursorFactory. The optimization now supports both LONG and TIMESTAMP columns, works with vectorizedGROUP BY, and improvesCROSS JOINperformance for small right-hand side tables. This allows queries with expressions, constants, and complex select clauses to benefit from the same performance optimizations as simple column-based ordering.This improvement includes several changes to enhance query engine scalability on multi-core machines. It fixes
MapFragment#totalFunctionCardinalitynot being updated in all cases inAsyncGroupAtom, which caused sharding stats to fluctuate between enabled and disabled states for queries with high-cardinalityCOUNT_DISTINCT(). Cardinality statistics inAsyncGroupAtomare now calculated as a sum of recorded group by function cardinalities instead of a maximum value, providing better estimates of merge work required. The configuration propertiescairo.page.frame.reduce.queue.capacityandcairo.sql.parallel.groupby.merge.shard.queue.capacityare now set to4 * cpu_coreswith a cap of256to improve worker thread utilization for parallelGROUP BYand filter queries. A newcairo.sql.parallel.filter.dispatch.limitconfig option (set tomax(cpu_cores, 32)by default) limits in-flight page frame tasks forLIMITqueries to improve latency. Additionally, padding was added toPerWorkerLocksto avoid false sharing.This improvement disables column pre-touch in parallel filters by default, as it had negative effects on query execution time in many scenarios despite improving performance in certain cases. The
cairo.sql.parallel.filter.pretouch.enabledconfiguration property is now deprecated and ignored by the server. Column pre-touch can now be enabled per-query using a new SQL hint:SELECT /*+ ENABLE_PRE_TOUCH(trades) */ symbol, side, timestamp FROM trades WHERE symbol = 'BTC-USDT' and timestamp in today();This improvement introduces a new hash map implementation that maps
CharSequencetointwithout relying on heap allocation or garbage collection. The new implementation replaces the existing CharSequenceIntHashMap used in WalWriter, which previously allocated heap-based String objects causing GC-churn during WAL transaction commits. Benchmarks show approximately 12% performance improvement with almost no memory allocations compared to the previous implementation, along with reduced jitter.This improvement deprecates the
useFastAsOfJoinconfiguration property and existing SQL hintsavoid_asof_binary_searchandavoid_lt_binary_search. It introduces new positive selection hints:asof_linear_searchfor the original algorithm with no optimizations, andasof_index_searchfor the algorithm that leverages the symbol index. These hints apply to bothASOFandLT JOINvariants.
Bug Fixes
This fix resolves out-of-memory errors that occurred with queries like
SELECT COUNT(*) FROM read_parquet('hits.parquet') WHERE AdvEngineID <> 0;when processing parquet files with large row groups (up to 600K rows). The issue was caused by keeping row group buffers around for each page frame reduce task, which combined with lack of projections, easily led to OOM errors.This fix resolves an issue where replicated WAL segments exceeding 4GiB in size would fail consistency checks. The consistency header previously stored the segment size as a u32, causing failures when segments exceeded this limit. The consistency check now correctly casts the u64 length to a u32, effectively using modulo 2^32, allowing the check to pass for larger segments.
This fix addresses network freeze issues where hundreds of stuck connections could prevent replication progress. Replication uploads and downloads now include configurable timeouts based on a base timeout and minimum throughput parameters. The timeout is recalculated and extended whenever upload or download progress is made. Default configuration includes
replication.requests.base.timeout=10sandreplication.requests.min.throughput=262144(256KiB/s). This improvement also resolves error handling when object store operations succeed but are considered failed due to timeouts.This fix resolves authentication failures that occurred when replica instances reloaded access control lists. The issue occurred because transaction numbers were updated before ACL reload completed, causing other workers to believe ACLs were current while still reloading. Requests for newly applied ACL changes like new users or tokens could fail authentication during this window. The transaction number update now occurs after ACL reload completion, ensuring all requests are processed with fully refreshed ACL models.
This fix resolves a race condition where multiple Influx Line Protocol connections competing to create the same table could encounter access denied errors. The issue occurred when the first connection made a table visible to others before owner permissions were granted, creating a brief window where the creator lacked access. Table creation and permission granting are now performed as a single atomic operation, with permissions granted before exposing the table.
WHEREclause filters likesym_col1 = sym_col2were mistakenly JIT compiled. The compiled filter function was using internal symbol codes for the equality check which produced incorrect results returned from the query. This fix disables JIT compilation for such filters to ensure correct query results.This fix addresses two nanosecond precision issues: incorrect nanosecond precision handling when calculating
TOLERANCEvalues forASOFjoins, and automatic timestamp type selection during CSV imports when timestamp columns are explicitly typed.The
LEAD()window function's third parameter (default placeholder value) was not initialized correctly when using bind variables in the positionlead(value, 1, ?), causing null reference exceptions. This fix ensures proper initialization of bind variables in this context.This fix resolves scenarios where unsafe memory access operations in compiled Java code would cause InternalError exceptions during table operations. The issue occurred when the O3PartitionPurgeJob attempted to read transaction data while tables were being processed under high system load, resulting in memory access faults in the TxReader component.
This fix resolves incorrect result set size display in the Web Console when the first partition of a query result exceeded the configured maximum page frame size. The improvement modifies size calculation in page frame cursors to properly account for the remaining size of backing partitions and fixes issues where some factories did not account for rows already consumed from cursors. Additionally resolves an issue in
PgAttrDefFunctionFactorywhere cursors were resetting automatically without callingtoTop().This fix prevents TableReferenceOutOfDateException errors in CheckWalTransactionsJob when tables are dropped while background jobs are running. The issue occurred during concurrent operations where cached query plans became invalid due to table schema changes or table removal, causing async jobs to fail with metadata version mismatches.
Fixed a bug where queries with bound parameters executed in a batch would return incorrect results. When multiple
SELECTstatements with different parameter values were sent together (pipelining), all queries would incorrectly use the parameter values from the last query in the batch instead of their own values. This fix affects PostgreSQL Wire Protocol clients that use pipelining, such as Npgsql (.NET), where multiple queries are sent before waiting for responses. Each query now correctly uses its own bound parameter values.Fixed a race condition where multiple clients trying to drop the same non-WAL table simultaneously could cause an internal error. This issue affected only legacy (non-WAL) tables and occurred because the table was unlocked too early, allowing another thread to acquire a lock for a non-existing table and attempt to delete an already-deleted directory.
Fixed protocol corruption that occurred when the server sent array data to PostgreSQL Wire Protocol clients. This fix resolves issues where queries returning array columns could cause communication errors between the server and client.
Enhanced the HTTP header parser to support legacy cookie
Expiresdate formats including HTTP 1.0 era 2-digit year formats and ANSI C asctime formats. This improvement adds fallback parsing patterns for deprecated but still-used date formats likeMon, 20-Oct-25 15:57:56 GMT, ensuring RFC 2616 compliance while maintaining backward compatibility. The enhancement includes automatic year adjustment for 2-digit years and minimal performance impact as legacy patterns are only tried after standard formats fail.
QuestDB 9.2.2 is another stability release, primarily fixing some query snags. With this, we've packaged a performance improvement for vertically-scaled machines, and some new SQL functions.
Breaking Changes
This change removes transaction scoreboard v1. Deployments explicitly configured with
cairo.txn.scoreboard.format=1will continue to function, but the option will be ignored and the server will use scoreboard v2.This fix resolves inconsistent materialized view data that could be observed on chained views with period refresh, where the dependent view may end up with rows that are no longer present in its base view. Additionally, MANUAL PERIOD views are no longer automatically refreshed when a period ends - the only way to refresh a MANUAL PERIOD view is now via
REFRESH MATERIALIZED VIEWSQL.
New Features
These new functions provide weighted statistical calculations with support for both frequency weights (representing number of occurrences) and reliability weights (representing importance or trustworthiness). The
WEIGHTED_STDDEV_REL()function calculates standard deviation for reliability weights, whileWEIGHTED_STDDEV_FREQ()handles frequency weights. An aliasWEIGHTED_STDDEV()is provided forWEIGHTED_STDDEV_REL(). These functions are particularly useful for financial calculations such as markout curve analysis.
Improvements
This improvement enhances query engine scalability through several optimizations: fixes cardinality statistics calculation in
AsyncGroupAtomto use sum instead of maximum values, sets queue capacities to4 * cpu_coreswith a cap of 256 for better worker thread utilization, introducescairo.sql.parallel.filter.dispatch.limitconfiguration option to improve LIMIT query latency, and adds padding toPerWorkerLocksto avoid false sharing. These changes significantly improve performance for parallelGROUP BYand filter queries on multi-core systems.This improvement modernizes the Web Console login interface with an updated visual design and enhanced user experience.
Bug Fixes
This fix resolves a bug where queries using
INclauses with INT/LONG columns containing-1could return incorrect results. The root cause was that the internal HashSet implementation used -1 as a special marker value, which conflicted with actual -1 values in the data, causing incorrect query results.This fix resolves an issue where multiple clients trying to drop the same non-WAL table could result in an internal error. The problem occurred because CairoEngine unlocked the table too early, allowing another thread to acquire a lock for a non-existing table and attempt to delete a directory that no longer exists. This issue affects only legacy (non-WAL) tables.
This fix resolves protocol corruption that occurred when the server sent array data to PostgreSQL Wire Protocol clients. The issue was particularly noticeable when fetching large result sets containing array columns in batches.
This fix updates how null/default values for decimal types are handled in both
TableWriterandWalWriter, switching from hardcoded primitive constants to centralized constants in theDecimalsclass. This improvement ensures that decimal columns default to null values when not provided in Influx Line Protocol data, improving consistency and maintainability in handling decimal nulls across the system.This fix resolves internal errors that occurred when the database was too lenient with type checking, particularly affecting window functions with default values, array access operations, and random generation functions. The fix renames
ColumnType.isAssignableFrom()toisConvertibleFrom()for clarity and replaces problematic invocations withisBuiltInWideningCast()orisSameOrBuiltInWideningCast(). These changes ensure that type conversion errors now fail gracefully with sensible error messages instead of causing internal failures.This fix resolves a bug where queries with bound parameters executed in a batch would return incorrect results. When multiple
SELECTstatements with different parameter values were sent together using PostgreSQL Wire Protocol pipelining, all queries would incorrectly use the parameter values from the last query in the batch instead of their own values. This particularly affected PostgreSQL clients that use pipelining, such as Npgsql (.NET), ensuring each query now correctly uses its own bound parameter values.This fix ensures that images are properly prefetched when handling connection error retry scenarios in the Web Console, preventing display issues during reconnection attempts.
This fix resolves issues with asset loading when QuestDB is deployed under a custom base path, ensuring that CSS, JavaScript, and other static resources are correctly served relative to the configured base URL.
QuestDB Enterprise 3.1.0 combines the latest advancements from OSS releases 9.1 and 9.2, bringing powerful new data types, nanosecond-precision time-series support, and significant query performance improvements. This release introduces native DECIMAL support with up to 76-digit precision, nanosecond timestamps, and delivers substantial performance gains through optimized ASOF JOINs.
Breaking Changes
This improvement enables multi-threaded execution for
SAMPLE BYqueries that combine timestamp and aggregate functions in selected expressions. The optimization improves execution plan selection forSAMPLE BYandGROUP BYqueries by avoiding duplicate group keys and reusing existing ones. When the designated timestamp is selected multiple times under different aliases, the query engine now picks the first ordered column, with preference given to columns used inORDER BYclauses or unaliased columns matching the designated timestamp name.
New Features
This feature integrates the
DECIMALtype into the database, making it available through SQL, Influx Line Protocol, and Parquet. The implementation supports decimals with precision up to 76 digits using the syntaxDECIMAL(precision, scale)where precision defines the maximum count of digits and scale defines the fractional part digits. The feature includes basic arithmetic operations (+,-,/,*,%), comparison operators, implicit casting from integer types, and support for aggregate functions likemin,max,avg,sum, andcount. Unlike traditional floating-point types, operations that cannot provide accurate values will throw an error, except for division which returns aHALF_DOWNrounded quotient. The type is fully integrated across all QuestDB interfaces including CSV import/export, PostgreSQL Wire Protocol, and Web Console.This feature introduces nanosecond precision timestamp support through the
TIMESTAMP_NSdata type. The implementation includes core nanosecond arithmetic and parser/formatter functionality, with the timestamp resolution carried by theTIMESTAMPdata type similar toGEOHASHtypes. The feature supports all major QuestDB operations includingSAMPLE BY,ASOF JOIN, materialized views, O3 ingestion, partition management, and Parquet file writing. The timestamp type information is stored in table metadata and properly handled across CSV imports, Influx Line Protocol clients (Java, C, and Python), and partition purge operations.This function generates high-cardinality symbols with skewed distribution patterns, making it particularly useful for testing algorithms that need to handle realistic symbol distributions with varying frequencies.
This function extends array aggregation capabilities by returning the last array value in a group. The implementation utilizes the specialized allocator for aggregate functions introduced in previous versions, providing efficient memory management for array operations.
This function returns the most frequent value in each group, eliminating the need for complex subqueries involving
count()operations and sorting. The implementation supportsBOOLEAN,LONG,VARCHAR,STRING,SYMBOL, andDOUBLEdata types with parallel execution capabilities. It uses a map-of-maps architecture where outer maps store pointers to inner maps that aggregate per-key counts, which are then merged and scanned for the most frequent value. This is useful for scenarios like finding the most frequently traded symbol by a trader or identifying the most common sensor state.This feature introduces support for
RIGHT OUTER JOINandFULL OUTER JOINoperations, expanding the range of join operations available. ForHashLightInnerJoinandHashLightFullOuterJoin, the implementation now supports limited runtime reordering of left and right cursors, allowing the join to dynamically choose the more efficient build/probe table combination when the right table is significantly larger than the left table. This provides users with alternative join strategies whenLEFT OUTER JOINperformance is unsatisfactory.
Improvements
This improvement introduces the Dense
ASOF JOINalgorithm, a variant of the Light algorithm that uses binary search to quickly jump to the timestamp matching the first left-hand row, rather than starting the scan from the top of the right-hand table. This optimization is particularly effective when the right-hand table contains extensive history that predates the first left-hand row, as the entire history is skipped except for recent data needed to match initial left-hand rows. The implementation also optimizes symbol-to-symbol joins through the RecordSink by copying symbol keys instead of symbol strings, and includes specialized implementations for critical performance scenarios like markout analysis.This improvement leverages the symbol index on the right-hand side of
ASOF JOINoperations, avoiding the need to visit rows that don't have matching symbols. This optimization is particularly beneficial for markout analysis scenarios where illiquid assets may have their last price posted far in the past. Instead of scanning backwards through all rows until finding a match, the implementation performs only one symbol index lookup per partition, dramatically reducing the search space for rare symbols.This improvement optimizes keyed
ASOF JOINoperations on non-indexed symbol columns by implementing memoization for symbol locations. When searching for matching right-hand side rows, the implementation now remembers the location of each symbol encountered during the search and reuses that knowledge for future lookups of the same symbol. It also remembers where symbols were not found, allowing it to skip already-scanned regions of the right-hand side table. This optimization is particularly effective for tables with rare symbols that occur infrequently, eliminating redundant scans when the same low-frequency symbol appears in multiple left-hand side rows.This improvement introduces selective caching for the Memoized Scan algorithm in
ASOF JOINoperations on symbol columns. The optimization maintains Match Caching (caching symbol locations after finding matches) while placing Drive-By Caching (caching all symbols encountered during searches) behind a query hint. This change reduces hash lookup overhead in typical cases where few symbols occur in the deep past of the right-hand side table, resulting in better performance for mostASOF JOINqueries.This improvement introduces deferred merge strategy for parallel
COUNT_DISTINCT()functions supporting int, ipv4, long, uuid and long256 types. Hash set merging now occurs duringgetLong()calls instead of when mergingGROUP BYhash tables, avoiding redundant merges in queries withORDER BYandLIMITclauses. The optimization includes single-key inlining forCOUNT_DISTINCT(int)andCOUNT_DISTINCT(long)functions, and implements a faster FxHasher-based hash function with reduced load factor for better performance.This improvement extends the fast-path optimization for
ORDER BY+LIMITqueries to work with virtual columns, expressions, andVirtualRecordCursorFactorycases that were previously excluded. The optimization now supportsTIMESTAMPcolumns in addition toLONGcolumns, includes fast-path support for vectorizedGROUP BYoperations, and improvesCROSS JOINperformance for small right-hand side tables. The enhancement enables fast-path execution for queries with virtual columns, subqueries, and complex expressions in the select list.This improvement disables column pre-touch in parallel filters by default to improve cache efficiency and query performance in most scenarios. The
cairo.sql.parallel.filter.pretouch.enabledconfiguration property is now deprecated and ignored by the server. Column pre-touch can now be enabled per-query using theENABLE_PRE_TOUCHSQL hint, providing more granular control over when this optimization is applied.This improvement introduces a new hash map implementation that maps
CharSequencetointwithout relying on heap allocation or garbage collection. TheWalWriterimplementation is updated to use this new hash map, replacing the existing CharSequenceIntHashMap that allocated heap-based String objects and caused GC-churn during WAL transaction commits. Benchmarks show almost no allocations in the new implementation compared to the previous version, with a 12% performance improvement and reduced jitter.This improvement deprecates the
useFastAsOfJoinconfiguration property and existing hintsavoid_asof_binary_searchandavoid_lt_binary_search. The new hints positively select algorithms:asof_linear_searchselects the original algorithm with no binary search or index scan optimizations, whileasof_index_searchrequests the algorithm that leverages the symbol index. These hints apply to both ASOF and LT JOIN variants.
Bug Fixes
This fix resolves out-of-memory errors that occurred with queries on large parquet files with substantial row groups. Previously, row group buffers were kept around for each page frame reduce task, causing memory exhaustion when processing files with large row groups (up to 600K rows) combined with lack of projections.
This fix resolves an issue where replicated WAL segments exceeding 4GiB in size failed consistency checks. The consistency header stored the segment size as a u32, causing failures when segments grew beyond this limit. The consistency check now correctly casts the u64 length to a u32, effectively using modulo 2^32, allowing the check to pass for larger segments.
This improvement addresses network freeze issues where stuck connections could prevent progress for the entire QuestDB process. Uploads and downloads now support configurable timeouts driven by a base timeout and minimum throughput parameters (
replication.requests.base.timeout=10sandreplication.requests.min.throughput=262144by default). The timeout is recalculated and extended whenever progress is made during upload or download operations, allowing connections to be closed and re-established when they become unresponsive.This fix resolves authentication failures that occurred when replica instances reloaded access control lists. The issue was that ACL reload updated transaction numbers before the reload actually completed, causing other workers to believe ACLs were up-to-date while they were still reloading. This resulted in requests depending on newly applied ACL changes failing authentication during the reload window. The transaction number update has been moved to the end of ACL reload to ensure all requests are processed after the ACL model is fully refreshed.
This fix resolves a race condition where multiple Influx Line Protocol connections competing to create the same table could encounter access denied errors. When the first connection created a table, it made the table visible to other connections before owner permissions were granted, leaving a brief window where the creator lacked access. This improvement makes table creation and permission granting a single atomic operation by granting permissions before exposing the table.
This fix resolves an issue where
WHEREclause filters likesym_col1 = sym_col2were mistakenly JIT compiled. The compiled filter function was using internal symbol codes for the equality check which produced incorrect query results. This patch disables JIT compilation for such filters to ensure correct behavior.This fix resolved two nanosecond precision issues: incorrect nanosecond precision handling when calculating TOLERANCE values for ASOF joins, and automatic timestamp type selection during CSV imports when timestamp columns are explicitly typed.
This fix resolved an issue where using a bind variable as the third parameter (default placeholder value) in the
LEAD()window function, such aslead(value, 1, ?), was not initialized correctly, leading to null reference exceptions.This fix resolved scenarios where unsafe memory access operations in compiled Java code caused InternalError exceptions during table operations, particularly affecting transaction reading and partition purge operations under high system stress.
This fix resolved incorrect result set size display in the Web Console when the first partition of a query's result exceeded the configured maximum page frame size. The improvement modified size calculation in page frame cursors to account for remaining partition size and fixed issues where some factories did not account for rows already consumed from the cursor.
This fix resolves
TableReferenceOutOfDateExceptionerrors that occurred when async jobs attempted to access tables that had been dropped or recreated during concurrent operations. The issue manifested in theCheckWalTransactionsJobwhen cached query plans referenced outdated table metadata, causing the job to fail when verifying table tokens against tables that no longer existed or had different metadata versions.This fix adds support for parsing legacy HTTP cookie
Expiresvalues that use deprecated date formats from the HTTP 1.0 era and ANSI C asctime formats. The parser now handles 2-digit year formats likeMon, 20-Oct-25 15:57:56 GMTand ANSI C formats likeSun Nov 6 08:49:37 1994. These patterns are added as fallback formats after standard HTTP formats, ensuring backward compatibility while maintaining RFC 2616 compliance for all three recommended HTTP date formats.
This release is a stability and performance update in the QuestDB 9.x line. It focuses on tightening up SQL correctness, improving latency, and hardening edge cases around our ingestion protocol, materialized views, with a few small quality-of-life features.
New Features
This feature extends the
FIRST_NOT_NULL()function to support array column types, providing the ability to find the first non-null array value in a dataset. The implementation follows the same pattern as other first/last aggregate functions while optimizing memory usage by including the boolean column type only where strictly needed.These functions provide file system access and pattern matching capabilities. The
files(s)function performs recursive directory scanning,glob(s)accepts glob patterns for file matching, andglob(Ss)applies glob pattern matching to strings as an alternative to regex andLIKEmatching. Example usage:SELECT glob('file1.txt', 'fi?e*.txt')orfiles('/import/trades') WHERE glob(path, '*') LIMIT 3.This feature introduces new API endpoints
/api/v1/sql/validateand/api/v1/sql/executefor SQL validation and execution. The existing/execAPI remains supported but is now deprecated. The implementation includes enhanced fuzz testing infrastructure and fixes for the JSON processor, CSV/Parquet export processor, and HTTP client.This feature introduces new
PERIOD (SAMPLE BY INTERVAL)syntax for period materialized views, providing an easier way to writePERIOD (LENGTH 1d TIME ZONE 'Europe/London'). Period materialized views now support second unit ('s') for length and delay intervals, in addition to the existing minutes ('m'), hours ('h'), and days ('d') units. The syntax cannot be combined withSAMPLE BY ... WITH OFFSET ....This feature allows charts to automatically scale the Y-axis based on actual data values instead of forcing zero as the minimum. This improvement makes small variations in large baseline values clearly visible, which is particularly useful for monitoring scenarios like temperature or pressure readings where the variations are small but significant. The chart now dynamically adjusts to show the actual data range, eliminating wasted chart space and providing better data visualization.
Improvements
This improvement reduces allocation rate when parsing HTTP headers by extending the internal map with a method to store keys without creating defensive copies. Previously, the header parser would pool HTTP headers but still create defensive copies of all keys when storing them in a map, generating unnecessary garbage.
This improvement eliminates unnecessary
DirectIntListinstance creation duringOrderedMapresizing operations and removes redundant address calculations for offset and hash code operations. The optimization reduces garbage collection pressure and improves performance across various map sizes, with particularly notable improvements in merge operations.This improvement significantly reduces query latency spikes by moving expensive memory unmapping operations to background threads. Users running latency-sensitive workloads will experience more consistent query response times, particularly at p99 and above. The feature is opt-in and can be enabled by setting
cairo.file.async.munmap.enabled=truein the server configuration. Benchmarks show substantial improvements in latency distribution, with average latency dropping from 46.46ms to 16.02ms and P99 latency improving from 359.68ms to 136.19ms.This improvement introduces a specialized cursor factory that emits
CROSS JOINoutput directly in the required order without materializing the orders table. The optimization significantly improves performance for markout curve queries that generate sampling points over the markout horizon of orders. In benchmarks with 3 million orders and 121 sampling points each, query time improved from 135 seconds to 17 seconds while reducing RAM usage from 10.7 GB to 2.4 GB.This improvement introduces cardinality statistics to
count_distinct()functions and optimizes parallel keyedGROUP BYoperations. The enhancement includes eager merging of parallelcount_distinct()functions, lightweight cardinality reporting, and automatic switching to map sharding for high-cardinality operations. The number of map shards is now fixed at 256 for better CPU cache locality, and the defaultcairo.sql.parallel.groupby.sharding.thresholdis reduced to 10k for more frequent parallel merge operations.
Bug Fixes
This fix addresses a compatibility issue where
Senderimplementations were encodingColumnType.NULLwhen sendingnullarray values to the server. SinceColumnType.NULLchanges each time a new column type is added, this created backwards compatibility problems. Senders now omit field values fornullarrays, excluding the column from the Influx Line Protocol message entirely, consistent with how other column types handle null values.This fix adds support for UUID binding variables within JIT-compiled filters. Previously, PostgreSQL Wire Protocol queries eligible for JIT compilation that used UUID parameters would fail with an "unsupported bind variable type: UUID" error. The fix eliminates the need for explicit casting workarounds that were required to disable JIT compilation.
This fix resolves a bug in the
optimizeOrderBypass that was incorrectly removingORDER BYclauses in sub-queries of aggregations. Functions likefirst(),last(),first_not_null(), andlast_not_null()require rows to be received in the expected order to work properly. The optimizer now correctly preserves ordering requirements for these aggregate functions.This fix resolves a compilation error that occurred when using aggregate functions with unary minus operators. The optimizer code previously assumed binary
-and+operations and failed to perform null checks for unary cases. The fix adds proper null checking and validation to ensure the operation is binary when expected.This fix corrects the
LimitRecordCursorFactory.size()method which was returning incorrect results forLIMIT x, yqueries when the underlying base cursor had fewer rows thany - x. The fix ensures proper size calculation by callingcountRows()when the base cursor size is unknown, following the existing pattern used in other branches for consistency.This fix resolves a SQL parsing issue for materialized view
ADD INDEXstatements when the SQL ends with a semicolon.This fix ensures that when a
SAMPLE BYquery specifies only one fill option, the logic for reorganizing fill values based on the positional relationship between TopDownColumns and BottomUpColumns remains consistent across all aggregation columns.This fix handles TableReferenceOutOfDateException when
ALTER TABLEis re-compiled inside the WAL apply process. Additionally, it removes error logging for wa*.lock file lock errors (now logged as debug) and fixes incorrect writerTxn values in wal_tables queries.This fix resolves an internal error that caused certain
ASOF JOINqueries to fail unexpectedly.This fix prevents query hints specified in the main
SELECTstatement from incorrectly being applied to Common Table Expressions (CTEs) within the same query.This fix resolves conditions within the
Limitfactory that were causing incorrect row handling in subqueries combined withUNIONoperations.This fix ensures that Parquet export properly handles empty tables and partitions by generating valid Parquet files containing only schema information, preventing export failures that could previously occur.
This fix resolves an issue where the
questdb_json_queries_connectionsgauge was being decremented by sources like 'export' and 'other' that would not increment the gauge, causing negative values.This fix moves the column width saving logic before the
setColumnWidthcall to prevent using stale column width references when arranging displayed array cell contents.
This release introduces significant new features including support for Decimal data types and symbol capacity auto-scaling enabled by default. Key improvements include critical fixes for storage corruption during deduplicate writes, enhanced ASOF JOIN performance for dense data interleaving, and better handling of async jobs under extreme system load. The release also transitions to using GraalVM CE JDK and includes various performance optimizations and infrastructure improvements.
New Features
This feature integrates the
DECIMALtype into the database, making it available through SQL, Influx Line Protocol, and PostgreSQL Wire Protocol. The implementation supports decimals with precision of 76 or less using the syntaxDECIMAL(precision, scale)where precision defines the maximum count of digits and scale defines the fractional digits. The feature includes basic arithmetic operations (+,-,/,*,%), comparison operators, type casting functions, aggregate functions (min,max,avg,sum,count), and math functions (abs,round,sign,floor,ceil). Unlike traditional floating-point types, when an operation cannot provide an accurate value, the database throws an error, with division being an exception that returns aHALF_DOWNrounded quotient.This feature introduces a new hash map implementation that maps
CharSequencetointwithout heap allocation or garbage collection, replacing the existingCharSequenceIntHashMapinWalWriterthat allocated heap-based String objects causing GC churn during WAL transaction commits. Benchmarks show approximately 12% performance improvement with almost zero allocations (0.622 B/op vs 72000.708 B/op) and reduced jitter compared to the original implementation.
Improvements
This improvement automatically adjusts symbol table capacity as needed, eliminating the need for manual configuration in most use cases.
This improvement introduces the Dense
ASOF JOINalgorithm, a variant of the Light algorithm that uses binary search to quickly jump to the timestamp matching the first left-hand row instead of starting the scan at the top. This optimization is highly important when the right-hand table has history that predates the first left-hand row, as the entire history will be skipped except for recent history needed to find matches of initial few left-hand rows. The improvement also optimizes symbol-to-symbol joins for existing Light and Fast algorithms by copying just the symbol key instead of the symbol string. In benchmarks with 167.5 million trade rows and 1.01 billion price rows, this feature achieved 730 milliseconds execution time compared to 9.7 seconds with the Light algorithm.This improvement transitions QuestDB releases to use GraalVM Community Edition JDK instead of the previous JDK distribution. The change has been validated through release CI and Docker builds to ensure compatibility and functionality.
Bug Fixes
This fix prevents
TableReferenceOutOfDateExceptionerrors that could occur when async jobs attempt to access tables that have been dropped or recreated during concurrent operations.This fix prevents
InternalErrorcrashes that could occur in async jobs when the system is under extreme memory pressure, particularly affecting operations like O3 partition purging.This fix resolves a critical storage corruption issue that could occur when deduplication operations resulted in identical data, particularly during materialized view refreshes. The bug could cause double memory close errors and incorrect directory deletions.
This fix resolves incorrect result set size display in the Web Console when the size of the first partition exceeded the configured maximum page frame size. The fix modifies size calculation in page frame cursors to account for the remaining size of the backing partition and addresses related bugs where some factories did not account for rows already consumed from the cursor.
This fix prevents out-of-memory errors when executing queries on large Parquet files with substantial row groups. The issue occurred because row group buffers were being retained for each page frame reduce task during parallel execution, leading to memory exhaustion when processing files with large row groups (up to 600K rows).
This fix resolves an issue where the query extraction logic incorrectly processed single quotes inside SQL comments, causing multi-statement queries to be treated as a single query. The improvement adds proper comment detection when toggling quote state parsing.
This release resolves a critical storage corruption bug introduced in v3.0.0 that could cause database crashes when writing to materialized views or tables with deduplication enabled. The issue occurred when transactions committed without changing data, such as re-inserting identical rows. We strongly recommend upgrading to v3.0.5 immediately.
Bug Fixes
This fix resolves a critical storage issue that was very likely to occur during materialized view refresh operations that resulted in the same data as already present in the view. The bug could cause double memory close operations and deletion of incorrect directories on the system.
QuestDB 9.1.1 focuses on data interoperability, nanosecond precision, and analytical performance. This release introduces Parquet export, a faster ASOF JOIN for sparse symbol data, and several engine optimizations improving both query throughput and stability.
Breaking Changes
This feature introduces several nanosecond timestamp functions including
now_ns()for current timestamp with nanosecond precision,rnd_timestamp_ns()for generating random timestamps with nanoseconds, andtimestamp_sequence_ns()for creating timestamp sequences with nanosecond precision. Theextract()function now supports extracting nanoseconds from timestamps. Additionally, this introducesMillsTimestampDriverfor better consistency between Date and Timestamp data types. However,rnd_timestamp()now only returnstimestamp_microprecision andtimestamp_sequence()only returnstimestamp_microprecision without considering the first argument's type for timestamp type determination.This fix enables multi-threaded execution for
SAMPLE BYqueries with selected expressions mixing timestamp and aggregate functions. It also improves execution plan selection forSAMPLE BYandGROUP BYqueries by avoiding duplicate group keys and reusing existing ones. When designated timestamp is selected multiple times under different aliases, the query engine now picks the first ordered column. If a selected column is used as the firstORDER BYcolumn, it takes preference regardless of position in theSELECTclause. Unaliased columns matching the designated column name are preferred over aliased ones.
New Features
This function is useful for testing algorithms with high-cardinality symbols that have skewed distribution of rows.
This feature adds support for exporting Parquet files via
COPY, improving ergonomics and making it easier for users to extract data as Parquet files without blocking writes. The feature supports exporting entire tables or subquery results with configurable options including compression codec, compression level, row group size, data page size, statistics, Parquet version, and raw array encoding. HTTP export is also supported via the/expendpoint withfmt=parquetparameter.This feature introduces protocol v2 where timestamps passed to the client in nanoseconds are sent as nanoseconds on the wire, while microseconds are sent as microseconds. The unit used to send data on the wire now drives the column type when tables or columns are created via Influx Line Protocol. If data is sent in nanoseconds, the column type will be
TIMESTAMP_NS; if sent in microseconds, it will beTIMESTAMP. The feature also includes overflow checking forTIMESTAMP_NScolumns and fixes thereset()method on the HTTP sender.This feature implements the
last(array)aggregate function using the specialized allocator for aggregate functions. This function allows users to retrieve the last array value in a group or partition, extending QuestDB's array handling capabilities for time-series analysis.This feature puts Drive-By Caching behind a query hint for the Memoized Scan algorithm in
ASOF JOINon symbol columns. The algorithm uses two types of caching: Match Caching (single hash lookup per row) and Drive-By Caching (hash lookup for every visited row). While Drive-By Caching benefits queries with many symbols in the deep past of the table, it introduces overhead in typical cases where fewer such symbols exist, making the query hint necessary for optimal performance.
Improvements
This improvement allows the
read_parquet()function to automatically detect when Parquet data is already sorted by timestamp, enabling more efficient query processing. The optimization also includes updates to compression library versions for zstd, snappy and other algorithms in parquet2.This improvement disables column pre-touch in parallel filters by default as it negatively impacted query execution time in many scenarios. The
cairo.sql.parallel.filter.pretouch.enabledconfiguration property is now deprecated and ignored. Column pre-touch can now be enabled per-query using the SQL hint/*+ ENABLE_PRE_TOUCH(table_name) */for specific use cases where it provides performance benefits.This improvement extends the fast path optimization for
ORDER BYwithLIMITqueries to work with virtual columns, expressions, andVirtualRecordCursorFactorycases that were previously excluded. The optimization now supports vectorizedGROUP BY, applies to TIMESTAMP columns in addition to LONG columns, and improvesCROSS JOINperformance for small right-hand side tables. This enhancement allows more complex queries with subqueries, expressions, and virtual columns to benefit from the fast path optimization.This improvement introduces deferred merge in parallel
COUNT_DISTINCT()functions for int, ipv4, long, uuid and long256 types. Hash set merge is now performed duringgetLong()calls instead of when mergingGROUP BYhash tables, avoiding redundant hash set merges in queries with ordering and limits. The optimization includes inlining single keys in MapValue forCOUNT_DISTINCT(int)andCOUNT_DISTINCT(long), and introduces a faster FxHasher-based hash function with reduced load factor from 0.75 to 0.5.This improvement addresses performance issues when keyed
ASOF JOINoperations encounter symbols that occur rarely in the right-hand side table. Previously, the linear scan would restart from scratch for each left-hand side row, causing repeated expensive searches for the same rare symbols. This optimization introduces memoization that remembers the location of each symbol encountered during the search and reuses that knowledge for subsequent scans. It also tracks regions where symbols were not found, allowing the scan to skip already-searched areas of the right-hand side table.This improvement provides clearer and more helpful error messages when users reference invalid columns in
JOINqueries, making it easier to identify and resolve query issues.This improvement optimizes
ASOF JOINoperations on single symbols by re-indexing symbols and joining on integer values rather than symbol text. The optimization is activated with theASOF_LINEAR_SEARCHhint and represents a variant of the linearASOF JOINalgorithm, providing significant performance gains for single-symbol join scenarios.
Bug Fixes
When a query was cancelled,
GroupByVectorAggregateJobwould log the error with critical level since it wasn't handling errors properly. This fix ensures proper error handling during query cancellation.In case of ~1B row table and daily sampling,
estimateRowsPerBucket()could overflow and return a very small estimate leading to too large interval size chosen for the materialized view refresh.This fix resolves connection leaks due to broken operation id order in the Windows I/O dispatcher. The bug could manifest when opening a TCP connection but writing to it only after some time, allowing the heartbeat mechanism to kick in. This fix also moves operation and event id generation code to
AbstractIODispatcherfor consistency across all I/O dispatcher implementations and switches OSX dispatcher to uselongcounter to prevent overflow issues.This fix prevents custom Web Console configurations from being overridden during QuestDB upgrades, preserving user customizations.
This fix resolves an overflow issue in
SampleByIntervalIteratorwhen calculating high boundaries on largeSAMPLE BYintervals with relatively large step values. A new configuration propertycairo.mat.view.max.refresh.step(default 1y) was introduced to define the maximum step interval, after which refresh falls back to a singleSAMPLE BYinterval to prevent overflow issues.This fix ensures that interval constants are properly recognized and applied as time intrinsics in SQL expressions.
This fix addresses two issues with
SAMPLE BYto parallelGROUP BYrewrite optimization. First, queries mixing timestamp and aggregate function expressions were producing unordered and incorrect results due to improper rewrite handling - the rewrite is now disabled for such queries. Second, the rewrite could incorrectly omit columns from the result set when expressions involved the designated timestamp column.This fix resolves a null pointer exception that could occur during SQL query optimization in the
ORDER BYrewrite process.This fix resolves an issue where the
lead()window function's third parameter (default placeholder value) was not initialized correctly when using bind variables in the positionlead(value, 1, ?), causing null reference exceptions during execution.This fix resolves a critical but rare edge case where the server could fail to start with a NullPointerException during initialization. The issue was triggered by a prior execution of SQL
UPDATEon a table that was open for reading.This fix resolves an issue where the indexer column top was not reset on truncate when an indexed symbol column was added to a non-partitioned table, causing errors on subsequent index rebuilds. The issue was specific to non-partitioned tables and was discovered through fuzz testing.
This fix resolves an issue where
WHEREclause filters likesym_col1 = sym_col2were incorrectly JIT compiled. The compiled filter function was using internal symbol codes for the equality check, which produced incorrect query results. This fix disables JIT compilation for such symbol column equality filters to ensure accurate results.This fix resolves an issue where Parquet export would fail when processing timestamps in descending order. The problem occurred because checking the timestamp index alone was insufficient to determine if the timestamp was designated - the scan direction must also be considered for proper validation.
This fix resolves parsing errors for cookie
Expiresvalues using legacy date formats. The HTTP header parser now supports HTTP 1.0 era 2-digit year formats and ANSI C asctime formats as fallback options after standard formats. The new patterns includeE, d-MMM-yy HH:mm:ss Zfor HTTP 1.0 andE MMM dd HH:mm:ss yyyyfor ANSI C formats. This maintains RFC 2616 compliance while ensuring backward compatibility with legacy systems that still use deprecated date formats.This fix resolves an issue on busy Linux systems where the server could fail to accept connections when many clients attempt to open new connections simultaneously. The problem manifested as client connection timeouts during high-load scenarios.
This fix resolves an issue where the query extraction logic was incorrectly processing single quotes inside SQL comments, causing multiple separate queries to be treated as a single query. The improvement adds proper comment detection when toggling quote parsing states.
This release adds Parquet export support and OAuth2 token authentication for PostgreSQL Wire Protocol, along with various bug fixes for replication, ILP, and SQL operations.
New Features
This feature introduces OAuth2 token authentication support for PostgreSQL Wire Protocol connections. It includes a new configuration key to enable the feature (disabled by default), and when enabled, passwords are treated as OAuth2 tokens if the username is '_sso' or no username is provided.
This feature adds a new
symbolTableSizecolumn to theSHOW COLUMNSoutput that displays the number of symbol dictionary codes present in the symbol table. This improvement also fixes table reader initialization for zero-column tables that previously failed when all columns were dropped.
Bug Fixes
This fix resolves an issue where replica tables could be suspended due to interaction between WAL downloader and WAL apply processes. The problem occurred when the Fd Cache cached sequencer file descriptors while the downloader used file renames to move files into correct directories, causing structure change metadata to become unavailable during WAL application.
This fix corrects the handling of double column data when ingested into symbol columns through Influx Line Protocol, ensuring proper symbol value conversion and storage.
This fix prevents automatic connection closure when cast errors occur during Influx Line Protocol processing, such as when sending a float value to a string column. When
line.tcp.disconnect.on.erroris set tofalse, the connection remains open and errors are no longer logged at critical level.This fix resolves compilation errors that occurred when casting symbol columns to symbol type within
GROUP BYqueries. The fix also removes string sinks fromIPv4Functionas they were only used inUPDATEstatements, which now generate wrapper cast functions like other fixed-size types.This fix adds proper validation to prevent multiple
DECLAREstatements from being used within a single query block, which was previously causing parsing issues.This fix corrects null-handling issues in
RecordToRowCopierutilities where null values were not properly converted during implicit casting. Previously, a nullLONGcolumn would incorrectly retain the minimum long value instead of becomingDouble.NaNwhen cast toDOUBLE.This fix ensures consistent permission checking across all
EXPLAINcommands by adding compile-time authorization forCREATE TABLEandCREATE MATERIALIZED VIEWstatements. It also provides proper error messages for unsupportedEXPLAIN DROP,EXPLAIN ALTER, andEXPLAIN REFRESHcommands.This fix resolves overflow issues that could occur during materialized view refresh operations, ensuring data integrity and preventing calculation errors.
This fix resolves a crash that occurred when using bind variables with the
LEAD()window function, ensuring stable query execution with parameterized queries.This fix ensures that the actual write failure error is reported instead of a misleading table locked error, providing clearer diagnostics for write operations.
This fix resolves a rare condition where dropping an index could cause table operations to become suspended, ensuring continued table availability during index management operations.
This fix resolves an assertion error that could occur when executing SQL queries with symbol index operations after table metadata changes, particularly affecting
GROUP BYqueries with symbol filters.This fix resolves assertion exceptions that occur in the writer when encountering operating system errors during write operations.
This fix resolves a file descriptor leak that occurred when the
ff.length(keyFd)call throws an exception during index writer initialization. The issue was caused by thekFdUnassignedflag being set to false before the potentially failing operation, preventing proper cleanup of the file descriptor.This fix resolves a null pointer exception that could occur during column purge operations.
QuestDB 9.1.0 is a feature-rich release bringing nanosecond timestamp data type, performance and stability improvements.
Breaking Changes
This change replaces the binary search algorithm for serving
IN(longlist...)predicates with a hashset implementation that scales better as list size increases. A new configuration settingcairo.sql.jit.max.in.list.size.thresholdspecifies the maximum size ofINlists accepted by the JIT compiler. When this limit is exceeded, the JIT compiler will abort and compile a Java filter instead, as JIT compilation unrollsINlists into equality comparisons chained by logical OR operators, which degrades performance for large lists. This change also fixes an existing bug where empty lists could cause incorrect behavior.
New Features
This feature adds new window functions:
first_value(timestamp),first_value(long),last_value(timestamp),last_value(long),max(timestamp),max(long),min(timestamp), andmin(long).This feature adds a new
symbolTableSizecolumn to theSHOW COLUMNSoutput that contains the number of symbol dictionary codes present in the symbol table. This improvement also fixes table reader initialization in cases where tables have zero columns after dropping all columns.This feature introduces comprehensive nanosecond timestamp support across QuestDB, including core arithmetic operations, parser/formatter functionality,
SAMPLE BYoperations, ASOF joins, materialized views, O3 ingestion, partition operations, Parquet export, CSV import, and Influx Line Protocol clients. The TIMESTAMP data type now carries resolution information similar to GEOHASH data types, with partition operations accompanied by timestampType metadata.This feature integrates async-profiler 4.1 with two profiling modes: continuous profiling that starts with the agent and on-demand profiling that attaches to running instances. The implementation supports configurable loop duration (default 30 minutes), JFR format output with timestamped files, custom profiling parameters, and automatic profile directory creation. Currently available for Linux x86-64 distributions with the libasyncProfiler.so library and asprof/jfrconv binaries included in the lib/ directory.
This feature leverages the symbol index on the right-hand side of
ASOF JOINoperations, avoiding the need to visit rows that don't have the matching symbol. This optimization is particularly beneficial for markout analysis scenarios where you need to lookup matching currency rates or other symbol-based data across large time ranges. Instead of scanning backwards through all rows until finding a match, the feature performs only one symbol index lookup per partition, dramatically improving performance for queries involving illiquid assets or sparse data.This feature introduces Right Outer Join and Full Outer Join functionality, expanding the range of supported join operations. For
HashLightInnerJoinandHashLightFullOuterJoin, the implementation now supports limited runtime reordering of left and right cursors, allowing joins to dynamically choose more efficient build/probe table combinations when the right table is significantly larger than the left table.This feature enables the Java Influx Line Protocol client to handle multiple possible endpoints. On first connection, the client polls
/settingsto determine if the instance is read-write or read-only. The sender automatically retries HTTP sends on 404 or 421 responses if the instance status changes after initialization.This feature returns the most frequent appearance of a column value for each group, eliminating the need for complex subqueries involving
COUNT()operations and sorting. The function is useful for identifying patterns like the most frequently traded symbol by a trader or the most common state for a sensor. The implementation uses a map-of-maps structure where outer maps store pointers to inner maps that aggregate per-key counts, with parallel execution support for improved performance.This feature deprecates the
useFastAsOfJoinconfiguration property and introduces positive algorithm selection hints for ASOF and LT JOIN variants. Theasof_linear_searchhint selects the original algorithm without binary search or index scan optimizations, whileasof_index_searchrequests the algorithm that leverages symbol indexes. The existingavoid_asof_binary_searchandavoid_lt_binary_searchhints are deprecated but still honored.This feature allows the server to accept gzip-encoded Influx Line Protocol data, addressing bandwidth inefficiencies in the text-based protocol. While compression introduces CPU overhead that reduces raw throughput, it significantly reduces bandwidth usage, which can raise the maximum throughput ceiling when receiving text-heavy data from multiple sources. The feature is particularly beneficial for scenarios involving logs and documents where the client-side encoding overhead is masked by improved network efficiency.
This feature improves PostgreSQL Wire Protocol query performance when using
IN()predicates on multiple indexed columns. The optimizer now considers column cardinality when selecting which indexed column to scan, even in the presence of binding variables, by choosing the column with higher cardinality for better selectivity. This optimization applies when the new column is different from the current key column, avoiding unsafe key merging operations.This function generates Zipf/Power Law Distribution of symbols from a provided list, where the probability of each symbol decays from left to right with a given 'alpha' velocity. Additionally includes
rnd_symbol_weighted()function for more verbose weighted symbol generation with explicit probability values for each symbol.This feature enables searching for column names within the schema filter, automatically expanding tables that contain matching columns. The implementation optimizes performance by removing redundant column requests on schema reload, fetching symbol column details on demand using
SHOW COLUMNS FROM, and includes a 10-second focus listener timeout for schema refresh.This feature provides visual feedback when executing multiple queries simultaneously, showing loading indicators and notifications for each individual query's progress and completion status.
Improvements
This improvement reintroduces a dedicated thread pool for materialized view refresh operations to prevent resource contention with WAL apply jobs. Heavy materialized view workloads can no longer monopolize the shared writer pool, eliminating potential delays or starvation of Apply WAL jobs and preventing database write performance degradation during intensive MV operations. When the dedicated MV pool is disabled or has 0 workers assigned, the MV refresh job will not run.
This improvement optimizes the
AVG(short)function by using a more appropriate accumulator type instead of long128, which was overkill for the value range. The Java implementation now useslonginstead ofdoubleas the accumulator, eliminating short-to-double conversion overhead and enabling faster addition instructions. Additionally reduces memory footprint ofSUM()aggregate functions and addsSumShortGroupByFunctionfor more efficient processing of short and byte values without null checks.This improvement rewrites trivial expressions that operate on the same column within
GROUP BYqueries for better performance. The optimization also includes integer value support for CSV import of IPv4 columns and addsyyyy-mm-dddate format to default text loader formats. Performance testing shows significant improvements, with ClickBench Q35 hot execution times improving from 724ms to 388ms.
Bug Fixes
This fix resolves a memory and file descriptor leak that occurred when a partition removal was followed by a schema change. The issue happened because stale bitmap indexes remained in the
bitmapIndexeslist beyond the valid range after partition deletion. When a table schema change triggeredcreateNewColumnList(), it created a new bitmap index list but only copied indexes within the valid range, leaving the stale indexes orphaned and their file descriptors never closed.This fix corrects the issue where suspended tables due to corrupt WAL files showed outdated
writerTxnnumbers inwal_tablesqueries. The problem occurred during exception handling when the Seq Table Tracker was incorrectly updated to the tabletxnnumber instead of theseqTxnnumber.This fix resolves timeout errors when opening TableReader after dropping and recreating non-WAL tables. The issue occurred when purge jobs pushed max txn using
TxnScoreboardPoolV2.isRangeAvailable(), preventing the reader from locking the latest transaction in the scoreboard. The fix makesTxnScoreboardPoolV2.isRangeAvailable()a read-only operation that doesn't modify the max value.This improvement enhances the reliability of the HTTP Influx Line Protocol sender by adding automatic retry logic during initial connection setup when fetching server configuration. Retries are attempted for up to the configured
maxRetriesNanosduration and only occur on server errors (500, 503, 504, 507, 509, 523, 524, 529, 599) and network failures. Success is now defined as any 2xx status code instead of only 200 for settings fetch.This fix resolves spurious memory mapping errors that occurred during stress testing of WAL purge operations, particularly when generating many WAL segments while running WalPurgeJob frequently. The issue was discovered through fuzz testing and could manifest after several minutes of operation on Windows systems.
This fix removes the implicit contract that failure to extend memory will close the memory, preventing table readers from being left in an inconsistent state during error conditions.
This fix resolves incorrect symbol value conversion when using double column data to ingest into symbol columns through Influx Line Protocol. The fix also enhances exception messages for invalid UTF-8 to UTF-16 conversions.
This fix resolves a race condition that created misleading "cannot lock table" errors when the actual issue was a failure in the ApplyWal job, such as missing WAL metadata changes. The fix prevents the creation of duplicate TableWriter objects in WriterPool that resulted from these race conditions.
This fix resolves compilation errors when casting symbol columns to symbol type within
GROUP BYqueries. Additionally, string sinks were removed fromIPv4Functionas they were only used inUPDATEoperations -UPDATEstatements will now generate wrapper cast functions similar to other fixed-size types.This fix addresses a critical issue where column rename failures in
TableWritercould leave metadata files in an inconsistent state. The solution reorders operations so commits happen before_todofile cleanup, adds table transaction information to the_todofile for validation, and ensuresTableWriteris marked as distressed if any rename step fails. The fix also modifiesTableWriterinitialization to read transaction information before processing_todofiles.This fix prevents multiple
DECLAREstatements from being executed within a single query block, ensuring proper SQL syntax validation and preventing potential parsing conflicts.This fix resolves two nanosecond-related issues: incorrect nanosecond precision handling when calculating
TOLERANCEvalues forASOFjoins, and improper automatic timestamp type selection during CSV imports when timestamp columns are explicitly typed.This fix prevents the Influx Line Protocol TCP connection and writer from being closed on cast errors when
line.tcp.disconnect.on.erroris set tofalse. Previously, protocol errors like attempting to send a float value to a string column would always close the connection regardless of the configuration setting. The error logging level has also been reduced from critical to a more appropriate level.This fix resolves an assertion error that could occur in SQL queries after metadata changes, particularly in
GROUP BYoperations involving symbol index lookups. The error was discovered through fuzz testing and manifested as an AssertionError in TableReader.getBitmapIndexReader during query execution with symbol filtering operations.This fix ensures that all
EXPLAINcommands consistently check user permissions before execution. Previously,EXPLAIN CREATE TABLEandEXPLAIN CREATE MATERIALIZED VIEWdid not perform authorization checks, while otherEXPLAINcommands did. This improvement also adds proper error messages forEXPLAIN DROP,EXPLAIN ALTER, andEXPLAIN REFRESHcommands.This fix resolves multiple issues with bind variables in function calls. Previously, bind variables were defined as typeless arrays requiring explicit casting, unsupported types would default to STRING or cause confusing errors, and functions with array arguments wouldn't define types properly for PostgreSQL drivers. The server now correctly rejects incompatible and unsupported types with clear error messages, and the function parser treats UNDEFINED types to match the best function signature, enabling proper bind variable usage with drivers like tokio-postgres.
This fix resolves a file descriptor leak that occurred when the
ff.length(keyFd)call throws an exception during index writer initialization. The problem was that thekFdUnassignedflag was being set tofalsebefore the potentially throwing call, causing thekeyFdto leak in error scenarios.When importing CSV files in QuestDB Enterprise, the Web Console queries user group memberships with
SHOW GROUPS username. This fix properly quotes the username parameter to handle spaces and dot characters that would otherwise break the SQL query.
This release focuses on stability improvements with fixes for rare replication issues, database shutdown crashes, and OSS synchronization. It includes new window functions and symbol table visibility features from the OSS codebase.
New Features
This feature adds new window functions:
first_value(timestamp),first_value(long),last_value(timestamp),last_value(long),max(timestamp),max(long),min(timestamp), andmin(long).This feature adds a new
symbolTableSizecolumn to theSHOW COLUMNSoutput, which contains the number of symbol dictionary codes present in the symbol table. This improvement also fixes table reader initialization in case of zero-columns tables.
Improvements
This improvement reintroduces a dedicated thread pool for materialized view refresh operations to prevent resource contention with WAL apply jobs. Heavy materialized view workloads can no longer monopolize the shared writer pool, preventing delays or starvation of Apply WAL jobs and maintaining database write performance during intensive MV operations.
This improvement changes the algorithm for serving
IN(longlist...)predicate to use a hashset instead of a binary search, which scales better as list size increases. A new configuration settingcairo.sql.jit.max.in.list.size.thresholdspecifies the size of theINlist that will be accepted by the JIT compiler. If this limit is exceeded, the JIT compiler will abort and compile a Java filter instead, since JIT compilation unrollsINlists into equality comparisons chained by logical or operators, which degrades performance for large lists. This improvement also fixes a bug where an empty list would incorrectly attempt to read a value.
Bug Fixes
This fix addresses a race condition where the
ApplyWal2TableJobcould fail and suspend when a segment directory was temporarily missing during the move operation from downloaded to target location. The issue occurred because the directory move was implemented as a two-step process that left a gap where no segment directory existed. This fix reverts to moving files individually when the directory already exists, while maintaining performance optimizations for fresh new segments. Additionally, this fix resolves a race condition in file lock acquisition on Linux and MacOS where the lock file could be deleted after opening but before locking.This fix prevents a fatal error that could occur during database shutdown when notifications were sent to the WalUploader after it had already been destroyed. The crash would manifest as a Java Runtime Environment fatal error during the shutdown sequence.
This fix ensures that group memberships and service account assignments are properly cleaned up when a user, group, or service account is dropped. Previously, these linked objects would remain orphaned after the parent entity was removed. This change also includes improved logging of user information messages from OAuth2 providers when required claims cannot be found.
This fix addresses a memory and file descriptor leak that could occur when a partition removal was followed by a schema change. The issue happened because stale bitmap indexes were left in the
bitmapIndexeslist beyond the valid range after partition deletion. WhencreateNewColumnList()was triggered by a schema change, these stale indexes were not included in the new list, causing them to become orphaned and never properly closed.After a table was suspended due to a corrupt WAL, the
wal_tablesquery displayed an outdatedwriterTxnnumber. This fix ensures that during exception handling, the Seq Table Tracker is updated to the tabletxnnumber rather thanseqTxn.When a non-WAL table was dropped and re-created, purge jobs could push max txn using
TxnScoreboardPoolV2.isRangeAvailable()leading to a timeout when opening TableReader, where it could not lock the latest transaction in the scoreboard. This fix makesTxnScoreboardPoolV2.isRangeAvailable()a read-only scoreboard operation that does not modify the max value.This fix resolves memory mapping errors that occurred during stress testing with frequent WAL segment generation and purge job execution, particularly affecting Windows environments during extended test runs.
This fix removes the implicit contract that failure to extend memory will close the memory, preventing table readers from being left in an inconsistent state during error conditions.
This release is a maintenance release that fixes lock acquisition race conditions in replication and includes compatibility updates with the OSS codebase, including disabling FD cache by default in Enterprise.
Bug Fixes
This fix establishes the correct order for creating segment paths and acquiring locks to prevent race conditions. The process now pre-creates the table directory before acquiring any locks, then creates and acquires the WAL lock while creating the WAL path if needed, followed by acquiring the segment lock and creating the segment path, and finally moving segment files into the WAL/segment directory while both locks are held.
QuestDB 9.0.3 is a stability and compatibility release that resolves critical SQL issues and upgrades the web console. The web console brings a unified search across all tabs.
New Features
This feature introduces a search mechanism that works across multiple tabs, including both active and archived ones, with search results displayed in a dedicated Search Panel on the left. The feature adds support for temporary buffers used when previewing search results from closed tabs and includes a new shortcut
Ctrl/Cmd + Shift + Fto open the search panel.
Bug Fixes
This fix reworked the state management of the metrics component, ensuring updates all go through the
updateBuffercall with no redundant local state, no unnecessary rerenders, and no mutations on context values. The metrics component was getting the buffers object from Dexie and setting a local state without properly syncing with storage, causing infinite loops where effects would set local state because context value changed, then run effects to set the context value again.This fix resolves compatibility issues where upgrading to database version 4 prevented opening Web Console with older QuestDB versions. The improvement updated Dexie to 4.2.0 which eliminates compatibility issues, and repositioning/reordering now uses bulk updates in the database to eliminate race conditions while temporary tab tracking is done in component state.
This release includes performance improvements for replication, with optimizations allowing replicas to skip unnecessary downloads and improved downloader performance. It also includes various bug fixes and OSS synchronization.
Improvements
This improvement introduces an important performance optimization for QuestDB Enterprise replication. Previously, when a replica lagged behind, it would download all missing sequencer
_txnlogparts at each iteration, repeatedly downloading the same parts if the replica was behind by several thousand transactions. This optimization tracks downloaded transactions separately, allowing the replica's downloader to skip parts it already has. While this provides minor benefits for replicas with sufficient resources, it significantly reduces work for replicas that are caught significantly behind, such as primaries recreated from backups, replicas that were offline, or underspecced replica boxes struggling with spikes.This improvement introduces various performance enhancements that decrease disk IO caused by replication, reduce unnecessary downloads, and improve concurrency by performing downloads in parallel. The changes include configuring sleep time after replication errors to use the poll interval instead of a hardcoded 200ms value, and fixing summary logic to avoid using filtered indexes. These optimizations reduce the time required for a replica to catch up when it has fallen behind.
Bug Fixes
This fix addresses lingering table metrics that would consume space in DataDog or other scraping middleware after tables were dropped. Once a table is dropped on the primary and replica, the prometheus metrics for that table are now cleared after a configurable number of requests. Two new configuration options are introduced:
replication.metrics.dropped.table.poll.count=10removes dropped table metrics after N times scraped, andreplication.metrics.per.table=trueallows completely disabling per-table replication metrics.
9.0.2 is a performance and stability release. It accelerates common query patterns, reduces resource usage, and adds quality-of-life features across SQL, the PostgreSQL wire protocol (PGWire), ingestion, and operations.
New Features
This feature allows negative array indices to select elements from the back of arrays, where -1 represents the last element. For example,
arr[-1]returns the last element,arr[-2]returns the second-to-last element, and range operations likearr[1:-1]andarr[-2:5]are also supported.These new SQL functions calculate minimum, maximum, and standard deviation (both sample and population) of double arrays, supporting both flat and multi-dimensional arrays. This feature includes efficient recursive and flat-array handling with improved processing of special values like NaN and Infinity to ensure correct results in array aggregation operations.
This feature extends the
TRUNCATE TABLEsyntax to support the full format:TRUNCATE TABLE [IF EXISTS] table1[,table2...] [KEEP SYMBOL TABLES]. TheIF EXISTSclause prevents errors when attempting to truncate non-existent tables, and multiple tables can be truncated in a single statement.This feature introduces FdCache and MmapCache classes to cache file descriptor and memory map operations, reducing redundant system calls and improving resource reuse. The implementation includes new caching layers and updates native file operations to use overlapped I/O on Windows.
This feature adds
reshape()methods to enable runtime dimension changes with variants for 1D, 2D, and 3D arrays, plus aclear()method to reset append position for error recovery aftersender.cancelRow()operations. Arrays can now be resized at runtime and single array instances can be reused across different shapes.This feature enables tables with array columns to be converted to and from Apache Parquet format using
ALTER TABLE CONVERT PARTITION TO PARQUET. Arrays are exported as lists of double values by default, following Parquet list requirements. An alternative native binary format is available via thecairo.partition.encoder.parquet.raw.array.encoding.enabled=trueconfiguration property.This feature enforces case-insensitive uniqueness validation for column names in the table schema dialog, preventing duplicate columns that differ only by case such as 'Age' and 'age'. Previously, the Web Console incorrectly allowed such duplicate column names to be created.
Improvements
This improvement adds a fast path to double quote replacement in
VarcharAdapter. Performance testing with the ClickHouse hits dataset showed parsing and WAL writes improved from 480 seconds to 410 seconds when sent to thePOST /impendpoint.This improvement changes QuestDB's default threading strategy from a single shared pool to three specialized thread pools. The Network thread pool handles network IO for HTTP, PostgreSQL Wire Protocol and Influx Line Protocol servers. The Query thread pool manages parallel queries including parallel filters and
GROUP BYoperations. The Writer thread pool runs WAL Apply Jobs, Table Writer jobs, materialized view refresh, and various purge and housekeeping jobs.This improvement replaces eager size calculation with lazy evaluation, using partition metadata to avoid unnecessarily opening partitions and applying approximate timestamp filtering before expensive partition reads. For hash joins, size computation is deferred until explicitly requested via
calculateSize(), eliminating upfront hash map traversal. This optimization particularly benefitsLIMITqueries on large datasets, time-filtered queries on partitioned tables, and Parquet partitions where metadata access is cheaper than full partition access.This improvement enables parallel execution of top-K queries by calculating top K records on each worker thread, then merging the results into a single tree chain. The feature can be disabled using the
cairo.sql.parallel.topk.enabled=falseconfiguration property. This optimization also fixes query cache clean-up on HTTP and PostgreSQL Wire Protocol connection close to improve cache efficiency for short-lived connections.
Bug Fixes
This fix resolves a race condition where SqlParser instance shared a single stateful visitor to replace variables. Under high query load, state from one SQL query could bleed into an unrelated SQL query, causing spurious 'undeclared variable' errors.
This fix removes the minimal refresh interval defined by
cairo.mat.view.min.refresh.intervalproperty. The previous logic led to larger refresh intervals being used for base table commits with small time intervals, such as single timestamp commits. For example, withSAMPLE BY 1s, even single timestamp commits would result in a 100-second refresh interval.This fix resolves incorrect alignment of fill values in
SAMPLE BYqueries when specific columns are projected in theSELECTclause.This fix prevents tables from becoming suspended when a replace commit operation fails on a column with a column top configuration.
This fix resolves race conditions in window function factories that were incorrectly designed as stateful, storing execution context as instance fields. Since QuestDB uses shared instances of function factories, concurrent SQL compilation would overwrite parameters, causing queries to fail or produce incorrect results. The fix makes window function factories stateless by removing all mutable state.
This fix resolves a bug in
YearTimestampSamplerwhere rounding calculations were incorrect when the rounded year was a leap year but the input year wasn't. This issue affected materialized view refresh operations where the range replace lo value could end up being greater than the minimal timestamp of the transaction.This fix changes the method of determining whether a table is a WAL table in
TableUpdateDetailsto prevent exceptions when writing to non-WAL tables through Influx Line Protocol over TCP.This fix implements transaction-aware initialization and reuse of bitmap index readers for improved stability under concurrent access. It includes enhanced null-value handling in forward readers and more consistent partition lifecycle handling and resource cleanup.
This fix addresses a rare memory leak that occurred when queries executed while a table was being altered simultaneously. The issue happened during internal query retries when the SQL compiler attempted to create a filter during query compilation. If filter creation failed due to concurrent schema modifications, the factory wasn't being freed before the retry, causing memory from failed attempts to not be released. While queries completed successfully through automatic retries, this edge case affected systems with concurrent schema modifications and queries on the same table.
This fix resolves a file descriptor leak that occurred when a table was dropped while WAL (Write-Ahead Log) files were being purged concurrently with TableWriter operations. The leak happened due to double-closing of file descriptors when exceptions occurred during WAL file operations. When TableWriter attempted to open WAL files that were being deleted by the WAL Purge job, it would get an exception and close the open WAL files, then attempt to close them again when the TableWriter object was closed, leading to a file descriptor leak.
This major release introduces Google Cloud Storage replication support, OIDC improvements for faster connection acceptance, and full synchronization with QuestDB 9.0 OSS features including multidimensional arrays and enhanced materialized views.
Breaking Changes
If the database is already setup to use the OAuth2 provider's configuration URL (
acl.oidc.configuration.urlis set in the config), then no action needed. If OIDC is setup with old-style configuration (separate host, port, endpoint settings), then theacl.oidc.public.keys.endpointproperty should be set in addition to the existing OIDC config. The public keys endpoint is OIDC provider specific but publicly available. Examples: Okta and JumpCloud:acl.oidc.public.keys.endpoint=/.well-known/jwks.jsonEntraId:acl.oidc.public.keys.endpoint=/{tenant id}/discovery/v2.0/keysPingFederate (default for QuestDB, so no need to add this line to the config):acl.oidc.public.keys.endpoint=/pf/JWKS
Release 9.0.1 brings notable enhancements to SYMBOL indexes, along with several bug fixes. Materialized Views now support indexed symbols, improving performance for workloads that rely on large views with numerous symbols. We've significantly optimized index performance under high concurrency, delivering 10x–100x speedup for index-intensive operations.
New Features
This feature introduces a unique data identifier to distinguish separate clusters and prevent conflicts during replication and restore operations. The identifier is accessible through the
CURRENT_DATA_ID()SQL function and consists of a 128-bit UUID stored in the database root as.data_id. This enables future conflict detection during restore operations against replications and backups.This feature enables adding and dropping indexes on symbol columns within materialized views using
ALTER MATERIALIZED VIEWsyntax. SQL automatically calculates the correct index capacity based on current data in the view, though capacity can still be specified manually. Example usage:ALTER MATERIALIZED VIEW view_1 ALTER COLUMN sym ADD INDEX;
Improvements
This improvement significantly enhances index lookup performance for partitions that do not change from query to query by replacing unnecessary unmap/remap operations with intelligent version checking. This optimization works fully on no-change to append partition update scenarios and also improves performance of index lookups during out-of-order (O3) operations by reducing map/unmap calls.
Bug Fixes
This fix resolves an internal error that occurred when comparing NULL arrays of the same type, such as
(null::double[] = null::double[])in SQL queries.This fix provides clearer, actionable error messages when attempting to create materialized views from
SELECTstatements that contain invalid column aliases.This fix resolves an issue where
DISTINCTqueries using table alias expressions would incorrectly throw an 'Invalid table name or alias' error.This fix ensures that column aliases maintain their specified casing in query results. Previously, aliases would revert to the original column name casing regardless of how they were specified in the query.
This fix addresses an edge case in the unquoting logic that was not properly handling escaped quotes in symbol filters during JIT compilation.