Adaptive Query Optimization System

Overview

This solution implements an adaptive query optimization system using DuckDB that automatically analyzes queries and builds optimal summary tables at runtime. The system achieves sub-100ms query performance through intelligent query analysis and dynamic materialization.

Demo at: https://youtu.be/XbhGLUYNuz8

Performance Comparison

Execution Time Comparison

Query	Baseline (s)	Optimized (s)	Speedup
Q1	21.363	0.001	21,363x
Q2	15.938	0.006	2,656x
Q3	15.759	0.005	3,152x
Q4	17.185	0.006	2,864x
Q5	30.802	0.002	15,401x
Total	101.047	0.032	3,158x

Visual Comparison

Baseline vs Optimized (seconds)

Q1:  ████████████████████████████████████████████ 21.363s
     ▏ 0.001s

Q2:  ████████████████████████████████ 15.938s
     ▏ 0.006s

Q3:  ████████████████████████████████ 15.759s
     ▏ 0.005s

Q4:  ██████████████████████████████████ 17.185s
     ▏ 0.006s

Q5:  ███████████████████████████████████████████████████████████ 30.802s
     ▏ 0.002s

Total: ████████████████████████████████████████████████████████████████████████████████████████████████████ 101.047s
       ▏ 0.032s

Speedup Factor

Q1:     21000x ████████████████████████████████████████████████████████████████████████████████████████████████████
Q2:      2000x ████████████▌
Q3:      3000x ██████████████▉
Q4:      2800x █████████████▌
Q5:     15000x █████████████████████████████████████████████████████████████████████████▎
Total:   3000x ██████████████▉

Summary

The optimized implementation achieves a ~3000x speedup overall, reducing total execution time from 101.047 seconds to just 0.032 seconds. The most dramatic improvement is seen in Q1 with a ~21000x speedup.

How It Works

The system operates in three distinct phases to achieve optimal query performance:

Phase 1: Query Analysis & Intelligent Merging

The adaptive optimizer analyzes all incoming queries to understand their structure and intelligently merges compatible queries:

Query Structure Analysis:

Extracts GROUP BY columns to determine aggregation granularity
Identifies aggregation functions (SUM, AVG, COUNT, etc.)
Distinguishes between constant filters (equality checks) and variable filters (ranges, BETWEEN)
Detects high-cardinality columns (user_id, auction_id, minute) that could create large summary tables

Intelligent Filter Handling:

Equality Filters on Non-GROUP BY Columns: Pre-applied during summary table creation to reduce data size
Non-Equality Filters (BETWEEN, <, >): Column is added to the summary table's GROUP BY so filtering can happen at query time
Filters on GROUP BY Columns: Never pre-applied; always filtered at query time for flexibility

Smart Merging Strategy:

Queries with identical GROUP BY columns and constant filters are merged into a single summary table
The merged table includes all aggregations from all compatible queries
This prevents redundant summary tables and reduces storage overhead
Example: If Q1 needs SUM(bid_price) and Q2 needs AVG(bid_price) with the same GROUP BY, one table is created with both aggregations

Cardinality Optimization:

Detects queries that would create extremely large summary tables (e.g., GROUP BY minute without time filters)
Skips summary table creation for high-cardinality queries without selective filters
These queries fall back to the sorted main table, which is still fast due to zonemap pruning

Phase 2: Database Materialization

Based on the analysis, the system builds an optimized database:

Main Table: Loads data from Parquet files and physically sorts by common filter columns (day, type, country, publisher_id) to enable DuckDB's zonemap pruning
Summary Tables: Creates pre-aggregated tables for each unique query pattern (after merging), applying constant filters during materialization
DISTINCT Tables: For non-aggregated SELECT queries, creates DISTINCT summary tables to eliminate duplicates and reduce data size

Phase 3: Query Execution & Adaptive Routing

The adaptive router intelligently routes queries to the optimal data structure:

Matches incoming queries against available summary tables using structural comparison
Rewrites SQL to use pre-aggregated data when a match is found
Two-Stage Filtering: Applies remaining filters (ranges, additional conditions) at query time, then re-aggregates if needed
Falls back to the sorted main table for queries without matching summaries
Maintains 100% routing accuracy through precise query pattern matching

Query Rewriting Strategy:

For queries using summary tables, the router:
1. Selects from the pre-aggregated summary table
2. Applies any non-constant filters (BETWEEN, ranges, etc.)
3. Re-aggregates to the query's original GROUP BY level if the summary has additional dimensions
4. Applies ORDER BY and LIMIT clauses

This approach eliminates the need for manual query optimization while maintaining 100% routing accuracy and sub-100ms query performance.

Key Technical Components

1. AdaptiveOptimizer (`adaptive_optimizer.py`)

The core analysis engine that processes queries and generates summary table specifications:

Main Methods:

analyze_queries(): Two-pass algorithm that first collects query specs, then merges compatible ones
_create_summary_spec(): Analyzes individual queries to determine optimal summary table structure
_merge_summary_specs(): Groups specs by signature (GROUP BY + filters) and merges compatible queries
generate_summary_table_sql(): Generates CREATE TABLE SQL with pre-applied constant filters

Key Algorithm - Filter Classification:

For each WHERE condition:
  If column in query's GROUP BY:
    → Skip (will filter at query time)
  Else if operator is equality (=):
    → Pre-apply in summary table (constant_filters)
  Else (BETWEEN, <, >, etc.):
    → Add column to summary GROUP BY (filter_dimensions)

2. AdaptiveRouter (`adaptive_router.py`)

Matches incoming queries to summary tables and rewrites SQL for optimal execution:

Matching Algorithm:

Compares query's GROUP BY and constant filters against each summary table's signature
Checks if summary table contains all required aggregations
Verifies that summary's filter dimensions can satisfy the query's filters

SQL Rewriting:

Replaces main table with summary table name
Removes pre-applied constant filters from WHERE clause
Adds re-aggregation step if summary has more dimensions than query needs
Preserves ORDER BY, LIMIT, and other query clauses

3. Database Builder (`database_builder.py`)

Constructs the optimized database with sorted main table and summary tables:

Optimization Techniques:

Physical sorting by (day, type, country, publisher_id) enables DuckDB's zonemap pruning
Batch creation of summary tables with progress tracking
Automatic detection and handling of high-cardinality scenarios

Design Advantages

Why This Approach Works

Automatic Optimization: No manual query tuning required - the system analyzes and optimizes automatically
Storage Efficiency: Intelligent merging prevents redundant summary tables; high-cardinality detection avoids bloat
Query Flexibility: Two-stage filtering (pre-apply + query-time) supports both constant and variable filters
Performance Guarantee: Pre-aggregation + zonemap pruning ensures sub-100ms queries even on 10M+ row datasets
100% Routing Accuracy: Precise structural matching guarantees correct results for every query

Key Design Decisions

Why Pre-Apply Only Equality Filters?

Equality filters (e.g., type = 'impression') are constants that never change
Pre-applying them reduces summary table size dramatically
Range filters (BETWEEN, <, >) vary per query, so we keep the dimension and filter at query time

Why Add Filter Columns to GROUP BY?

Allows query-time filtering on ranges without losing pre-aggregation benefits
Example: For day BETWEEN '2024-01-01' AND '2024-01-31', we GROUP BY day in the summary, then filter at query time
Enables one summary table to serve multiple queries with different date ranges

Why Merge Compatible Queries?

Reduces storage: One table with SUM(x), AVG(x), COUNT(x) instead of three separate tables
Simplifies routing: Fewer tables to check during query matching
Improves build time: Fewer CREATE TABLE operations

Why Skip High-Cardinality Summaries?

GROUP BY minute creates 500K+ rows - larger than the filtered main table
Without selective filters, the summary offers no performance benefit
Sorted main table with zonemap pruning is faster for these queries

Example: How It Works End-to-End

Consider these two queries:

Q1: SELECT publisher_id, SUM(bid_price) 
    FROM events 
    WHERE type='impression' AND country='US'
    GROUP BY publisher_id

Q2: SELECT publisher_id, AVG(bid_price) 
    FROM events 
    WHERE type='impression' AND country='US' AND day BETWEEN '2024-01-01' AND '2024-01-31'
    GROUP BY publisher_id

Phase 1: Analysis

Both queries have identical GROUP BY (publisher_id) and constant filters (type='impression', country='US')
Q2 has an additional BETWEEN filter on day
System adds day to the summary GROUP BY to enable query-time filtering
Queries are merged into one summary table with both SUM and AVG aggregations

Phase 2: Summary Table Creation

CREATE TABLE summary_q1_merged_day_publisher_id AS
SELECT 
    day,                           -- Added for Q2's BETWEEN filter
    publisher_id,                  -- Original GROUP BY
    SUM(bid_price) AS sum_bid_price,
    AVG(bid_price) AS avg_bid_price
FROM events
WHERE type='impression' AND country='US'  -- Pre-applied constant filters
GROUP BY day, publisher_id

Phase 3: Query Execution

Q1 is rewritten to:

SELECT publisher_id, SUM(sum_bid_price) AS sum_bid_price
FROM summary_q1_merged_day_publisher_id
GROUP BY publisher_id  -- Re-aggregate across all days

Q2 is rewritten to:

SELECT publisher_id, SUM(sum_bid_price * row_count) / SUM(row_count) AS avg_bid_price
FROM summary_q1_merged_day_publisher_id
WHERE day BETWEEN '2024-01-01' AND '2024-01-31'  -- Query-time filter
GROUP BY publisher_id  -- Re-aggregate for selected days

Result: Both queries execute in <5ms using the same pre-aggregated summary table!

Benchmarks against sample queries:

Query 1: {'select': ['day', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}], 'group_by': ['day']}

Rows: 366 | Time: 0.001s

Query 2: {'select': ['publisher_id', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'country', 'op': 'eq', 'val': 'JP'}, {'col': 'day', 'op': 'between', 'val': ['2024-10-20', '2024-10-23']}], 'group_by': ['publisher_id']}

Rows: 1114 | Time: 0.006s

Query 3: {'select': ['country', {'AVG': 'total_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'purchase'}], 'group_by': ['country'], 'order_by': [{'col': 'AVG(total_price)', 'dir': 'desc'}]}

Rows: 12 | Time: 0.005s

Query 4: {'select': ['advertiser_id', 'type', {'COUNT': ''}], 'from': 'events', 'group_by': ['advertiser_id', 'type'], 'order_by': [{'col': 'COUNT()', 'dir': 'desc'}]}

Rows: 6616 | Time: 0.006s

Query 5: {'select': ['minute', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'day', 'op': 'eq', 'val': '2024-06-01'}], 'group_by': ['minute'], 'order_by': [{'col': 'minute', 'dir': 'asc'}]}

Rows: 1440 | Time: 0.002s

Summary: Q1: 0.001s (366 rows) Q2: 0.006s (1114 rows) Q3: 0.005s (12 rows) Q4: 0.006s (6616 rows) Q5: 0.002s (1440 rows) Total time: 0.032s

How to Setup

1. Install the requirements

pip install -r requirements.txt

2. Convert CSV files to Parquet format

python convert_csv_to_parquet.py --input-dir ../data --output-dir ../data_parquet

This will convert all CSV files in the ../data directory to Parquet format and save them in ../data_parquet.

3. Load data and execute queries

First, create the output directory:

mkdir ../outputs

Then run the main script:

python run_all.py --data-dir ../data_parquet --out-dir ../outputs

For averaging, simply run:

python run_all.py --data-dir ../data_paraquet --out-dir ../outputs --skip-main-table

The main table created in phase 1 remains constant.

This will analyze queries, build the optimized database with adaptive summary tables, execute all queries, and save results to CSV files in the ../outputs directory.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
baseline		baseline
non-baseline		non-baseline
.gitignore		.gitignore
Mindmap.pdf		Mindmap.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Query Optimization System

Overview

Performance Comparison

Execution Time Comparison

Visual Comparison

Baseline vs Optimized (seconds)

Speedup Factor

Summary

How It Works

Phase 1: Query Analysis & Intelligent Merging

Phase 2: Database Materialization

Phase 3: Query Execution & Adaptive Routing

Key Technical Components

1. AdaptiveOptimizer (`adaptive_optimizer.py`)

2. AdaptiveRouter (`adaptive_router.py`)

3. Database Builder (`database_builder.py`)

Design Advantages

Why This Approach Works

Key Design Decisions

Example: How It Works End-to-End

Benchmarks against sample queries:

How to Setup

1. Install the requirements

2. Convert CSV files to Parquet format

3. Load data and execute queries

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive Query Optimization System

Overview

Performance Comparison

Execution Time Comparison

Visual Comparison

Baseline vs Optimized (seconds)

Speedup Factor

Summary

How It Works

Phase 1: Query Analysis & Intelligent Merging

Phase 2: Database Materialization

Phase 3: Query Execution & Adaptive Routing

Key Technical Components

1. AdaptiveOptimizer (adaptive_optimizer.py)

2. AdaptiveRouter (adaptive_router.py)

3. Database Builder (database_builder.py)

Design Advantages

Why This Approach Works

Key Design Decisions

Example: How It Works End-to-End

Benchmarks against sample queries:

How to Setup

1. Install the requirements

2. Convert CSV files to Parquet format

3. Load data and execute queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. AdaptiveOptimizer (`adaptive_optimizer.py`)

2. AdaptiveRouter (`adaptive_router.py`)

3. Database Builder (`database_builder.py`)

Packages