This solution implements an adaptive query optimization system using DuckDB that automatically analyzes queries and builds optimal summary tables at runtime. The system achieves sub-100ms query performance through intelligent query analysis and dynamic materialization.
Demo at: https://youtu.be/XbhGLUYNuz8
| Query | Baseline (s) | Optimized (s) | Speedup |
|---|---|---|---|
| Q1 | 21.363 | 0.001 | 21,363x |
| Q2 | 15.938 | 0.006 | 2,656x |
| Q3 | 15.759 | 0.005 | 3,152x |
| Q4 | 17.185 | 0.006 | 2,864x |
| Q5 | 30.802 | 0.002 | 15,401x |
| Total | 101.047 | 0.032 | 3,158x |
Q1: ββββββββββββββββββββββββββββββββββββββββββββ 21.363s
β 0.001s
Q2: ββββββββββββββββββββββββββββββββ 15.938s
β 0.006s
Q3: ββββββββββββββββββββββββββββββββ 15.759s
β 0.005s
Q4: ββββββββββββββββββββββββββββββββββ 17.185s
β 0.006s
Q5: βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 30.802s
β 0.002s
Total: ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 101.047s
β 0.032s
Q1: 21000x ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q2: 2000x βββββββββββββ
Q3: 3000x βββββββββββββββ
Q4: 2800x ββββββββββββββ
Q5: 15000x ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total: 3000x βββββββββββββββ
The optimized implementation achieves a ~3000x speedup overall, reducing total execution time from 101.047 seconds to just 0.032 seconds. The most dramatic improvement is seen in Q1 with a ~21000x speedup.
The system operates in three distinct phases to achieve optimal query performance:
The adaptive optimizer analyzes all incoming queries to understand their structure and intelligently merges compatible queries:
Query Structure Analysis:
- Extracts GROUP BY columns to determine aggregation granularity
- Identifies aggregation functions (SUM, AVG, COUNT, etc.)
- Distinguishes between constant filters (equality checks) and variable filters (ranges, BETWEEN)
- Detects high-cardinality columns (user_id, auction_id, minute) that could create large summary tables
Intelligent Filter Handling:
- Equality Filters on Non-GROUP BY Columns: Pre-applied during summary table creation to reduce data size
- Non-Equality Filters (BETWEEN, <, >): Column is added to the summary table's GROUP BY so filtering can happen at query time
- Filters on GROUP BY Columns: Never pre-applied; always filtered at query time for flexibility
Smart Merging Strategy:
- Queries with identical GROUP BY columns and constant filters are merged into a single summary table
- The merged table includes all aggregations from all compatible queries
- This prevents redundant summary tables and reduces storage overhead
- Example: If Q1 needs
SUM(bid_price)and Q2 needsAVG(bid_price)with the same GROUP BY, one table is created with both aggregations
Cardinality Optimization:
- Detects queries that would create extremely large summary tables (e.g., GROUP BY minute without time filters)
- Skips summary table creation for high-cardinality queries without selective filters
- These queries fall back to the sorted main table, which is still fast due to zonemap pruning
Based on the analysis, the system builds an optimized database:
- Main Table: Loads data from Parquet files and physically sorts by common filter columns (
day,type,country,publisher_id) to enable DuckDB's zonemap pruning - Summary Tables: Creates pre-aggregated tables for each unique query pattern (after merging), applying constant filters during materialization
- DISTINCT Tables: For non-aggregated SELECT queries, creates DISTINCT summary tables to eliminate duplicates and reduce data size
The adaptive router intelligently routes queries to the optimal data structure:
- Matches incoming queries against available summary tables using structural comparison
- Rewrites SQL to use pre-aggregated data when a match is found
- Two-Stage Filtering: Applies remaining filters (ranges, additional conditions) at query time, then re-aggregates if needed
- Falls back to the sorted main table for queries without matching summaries
- Maintains 100% routing accuracy through precise query pattern matching
Query Rewriting Strategy:
- For queries using summary tables, the router:
- Selects from the pre-aggregated summary table
- Applies any non-constant filters (BETWEEN, ranges, etc.)
- Re-aggregates to the query's original GROUP BY level if the summary has additional dimensions
- Applies ORDER BY and LIMIT clauses
This approach eliminates the need for manual query optimization while maintaining 100% routing accuracy and sub-100ms query performance.
The core analysis engine that processes queries and generates summary table specifications:
Main Methods:
analyze_queries(): Two-pass algorithm that first collects query specs, then merges compatible ones_create_summary_spec(): Analyzes individual queries to determine optimal summary table structure_merge_summary_specs(): Groups specs by signature (GROUP BY + filters) and merges compatible queriesgenerate_summary_table_sql(): Generates CREATE TABLE SQL with pre-applied constant filters
Key Algorithm - Filter Classification:
For each WHERE condition:
If column in query's GROUP BY:
β Skip (will filter at query time)
Else if operator is equality (=):
β Pre-apply in summary table (constant_filters)
Else (BETWEEN, <, >, etc.):
β Add column to summary GROUP BY (filter_dimensions)Matches incoming queries to summary tables and rewrites SQL for optimal execution:
Matching Algorithm:
- Compares query's GROUP BY and constant filters against each summary table's signature
- Checks if summary table contains all required aggregations
- Verifies that summary's filter dimensions can satisfy the query's filters
SQL Rewriting:
- Replaces main table with summary table name
- Removes pre-applied constant filters from WHERE clause
- Adds re-aggregation step if summary has more dimensions than query needs
- Preserves ORDER BY, LIMIT, and other query clauses
Constructs the optimized database with sorted main table and summary tables:
Optimization Techniques:
- Physical sorting by
(day, type, country, publisher_id)enables DuckDB's zonemap pruning - Batch creation of summary tables with progress tracking
- Automatic detection and handling of high-cardinality scenarios
- Automatic Optimization: No manual query tuning required - the system analyzes and optimizes automatically
- Storage Efficiency: Intelligent merging prevents redundant summary tables; high-cardinality detection avoids bloat
- Query Flexibility: Two-stage filtering (pre-apply + query-time) supports both constant and variable filters
- Performance Guarantee: Pre-aggregation + zonemap pruning ensures sub-100ms queries even on 10M+ row datasets
- 100% Routing Accuracy: Precise structural matching guarantees correct results for every query
Why Pre-Apply Only Equality Filters?
- Equality filters (e.g.,
type = 'impression') are constants that never change - Pre-applying them reduces summary table size dramatically
- Range filters (BETWEEN, <, >) vary per query, so we keep the dimension and filter at query time
Why Add Filter Columns to GROUP BY?
- Allows query-time filtering on ranges without losing pre-aggregation benefits
- Example: For
day BETWEEN '2024-01-01' AND '2024-01-31', we GROUP BY day in the summary, then filter at query time - Enables one summary table to serve multiple queries with different date ranges
Why Merge Compatible Queries?
- Reduces storage: One table with
SUM(x), AVG(x), COUNT(x)instead of three separate tables - Simplifies routing: Fewer tables to check during query matching
- Improves build time: Fewer CREATE TABLE operations
Why Skip High-Cardinality Summaries?
- GROUP BY minute creates 500K+ rows - larger than the filtered main table
- Without selective filters, the summary offers no performance benefit
- Sorted main table with zonemap pruning is faster for these queries
Consider these two queries:
Q1: SELECT publisher_id, SUM(bid_price)
FROM events
WHERE type='impression' AND country='US'
GROUP BY publisher_id
Q2: SELECT publisher_id, AVG(bid_price)
FROM events
WHERE type='impression' AND country='US' AND day BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY publisher_idPhase 1: Analysis
- Both queries have identical GROUP BY (
publisher_id) and constant filters (type='impression',country='US') - Q2 has an additional BETWEEN filter on
day - System adds
dayto the summary GROUP BY to enable query-time filtering - Queries are merged into one summary table with both
SUMandAVGaggregations
Phase 2: Summary Table Creation
CREATE TABLE summary_q1_merged_day_publisher_id AS
SELECT
day, -- Added for Q2's BETWEEN filter
publisher_id, -- Original GROUP BY
SUM(bid_price) AS sum_bid_price,
AVG(bid_price) AS avg_bid_price
FROM events
WHERE type='impression' AND country='US' -- Pre-applied constant filters
GROUP BY day, publisher_idPhase 3: Query Execution
-
Q1 is rewritten to:
SELECT publisher_id, SUM(sum_bid_price) AS sum_bid_price FROM summary_q1_merged_day_publisher_id GROUP BY publisher_id -- Re-aggregate across all days
-
Q2 is rewritten to:
SELECT publisher_id, SUM(sum_bid_price * row_count) / SUM(row_count) AS avg_bid_price FROM summary_q1_merged_day_publisher_id WHERE day BETWEEN '2024-01-01' AND '2024-01-31' -- Query-time filter GROUP BY publisher_id -- Re-aggregate for selected days
Result: Both queries execute in <5ms using the same pre-aggregated summary table!
Query 1: {'select': ['day', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}], 'group_by': ['day']}
Rows: 366 | Time: 0.001s
Query 2: {'select': ['publisher_id', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'country', 'op': 'eq', 'val': 'JP'}, {'col': 'day', 'op': 'between', 'val': ['2024-10-20', '2024-10-23']}], 'group_by': ['publisher_id']}
Rows: 1114 | Time: 0.006s
Query 3: {'select': ['country', {'AVG': 'total_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'purchase'}], 'group_by': ['country'], 'order_by': [{'col': 'AVG(total_price)', 'dir': 'desc'}]}
Rows: 12 | Time: 0.005s
Query 4: {'select': ['advertiser_id', 'type', {'COUNT': ''}], 'from': 'events', 'group_by': ['advertiser_id', 'type'], 'order_by': [{'col': 'COUNT()', 'dir': 'desc'}]}
Rows: 6616 | Time: 0.006s
Query 5: {'select': ['minute', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'day', 'op': 'eq', 'val': '2024-06-01'}], 'group_by': ['minute'], 'order_by': [{'col': 'minute', 'dir': 'asc'}]}
Rows: 1440 | Time: 0.002s
Summary: Q1: 0.001s (366 rows) Q2: 0.006s (1114 rows) Q3: 0.005s (12 rows) Q4: 0.006s (6616 rows) Q5: 0.002s (1440 rows) Total time: 0.032s
pip install -r requirements.txtpython convert_csv_to_parquet.py --input-dir ../data --output-dir ../data_parquetThis will convert all CSV files in the ../data directory to Parquet format and save them in ../data_parquet.
First, create the output directory:
mkdir ../outputsThen run the main script:
python run_all.py --data-dir ../data_parquet --out-dir ../outputsFor averaging, simply run:
python run_all.py --data-dir ../data_paraquet --out-dir ../outputs --skip-main-tableThe main table created in phase 1 remains constant.
This will analyze queries, build the optimized database with adaptive summary tables, execute all queries, and save results to CSV files in the ../outputs directory.