Skip to content

kee5625/QP_fast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Adaptive Query Optimization System

Overview

This solution implements an adaptive query optimization system using DuckDB that automatically analyzes queries and builds optimal summary tables at runtime. The system achieves sub-100ms query performance through intelligent query analysis and dynamic materialization.

Demo at: https://youtu.be/XbhGLUYNuz8

Performance Comparison

Execution Time Comparison

Query Baseline (s) Optimized (s) Speedup
Q1 21.363 0.001 21,363x
Q2 15.938 0.006 2,656x
Q3 15.759 0.005 3,152x
Q4 17.185 0.006 2,864x
Q5 30.802 0.002 15,401x
Total 101.047 0.032 3,158x

Visual Comparison

Baseline vs Optimized (seconds)

Q1:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 21.363s
     ▏ 0.001s

Q2:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 15.938s
     ▏ 0.006s

Q3:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 15.759s
     ▏ 0.005s

Q4:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 17.185s
     ▏ 0.006s

Q5:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 30.802s
     ▏ 0.002s

Total: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 101.047s
       ▏ 0.032s

Speedup Factor

Q1:     21000x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Q2:      2000x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ
Q3:      3000x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰
Q4:      2800x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ
Q5:     15000x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž
Total:   3000x β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰

Summary

The optimized implementation achieves a ~3000x speedup overall, reducing total execution time from 101.047 seconds to just 0.032 seconds. The most dramatic improvement is seen in Q1 with a ~21000x speedup.

How It Works

The system operates in three distinct phases to achieve optimal query performance:

Phase 1: Query Analysis & Intelligent Merging

The adaptive optimizer analyzes all incoming queries to understand their structure and intelligently merges compatible queries:

Query Structure Analysis:

  • Extracts GROUP BY columns to determine aggregation granularity
  • Identifies aggregation functions (SUM, AVG, COUNT, etc.)
  • Distinguishes between constant filters (equality checks) and variable filters (ranges, BETWEEN)
  • Detects high-cardinality columns (user_id, auction_id, minute) that could create large summary tables

Intelligent Filter Handling:

  • Equality Filters on Non-GROUP BY Columns: Pre-applied during summary table creation to reduce data size
  • Non-Equality Filters (BETWEEN, <, >): Column is added to the summary table's GROUP BY so filtering can happen at query time
  • Filters on GROUP BY Columns: Never pre-applied; always filtered at query time for flexibility

Smart Merging Strategy:

  • Queries with identical GROUP BY columns and constant filters are merged into a single summary table
  • The merged table includes all aggregations from all compatible queries
  • This prevents redundant summary tables and reduces storage overhead
  • Example: If Q1 needs SUM(bid_price) and Q2 needs AVG(bid_price) with the same GROUP BY, one table is created with both aggregations

Cardinality Optimization:

  • Detects queries that would create extremely large summary tables (e.g., GROUP BY minute without time filters)
  • Skips summary table creation for high-cardinality queries without selective filters
  • These queries fall back to the sorted main table, which is still fast due to zonemap pruning

Phase 2: Database Materialization

Based on the analysis, the system builds an optimized database:

  • Main Table: Loads data from Parquet files and physically sorts by common filter columns (day, type, country, publisher_id) to enable DuckDB's zonemap pruning
  • Summary Tables: Creates pre-aggregated tables for each unique query pattern (after merging), applying constant filters during materialization
  • DISTINCT Tables: For non-aggregated SELECT queries, creates DISTINCT summary tables to eliminate duplicates and reduce data size

Phase 3: Query Execution & Adaptive Routing

The adaptive router intelligently routes queries to the optimal data structure:

  • Matches incoming queries against available summary tables using structural comparison
  • Rewrites SQL to use pre-aggregated data when a match is found
  • Two-Stage Filtering: Applies remaining filters (ranges, additional conditions) at query time, then re-aggregates if needed
  • Falls back to the sorted main table for queries without matching summaries
  • Maintains 100% routing accuracy through precise query pattern matching

Query Rewriting Strategy:

  • For queries using summary tables, the router:
    1. Selects from the pre-aggregated summary table
    2. Applies any non-constant filters (BETWEEN, ranges, etc.)
    3. Re-aggregates to the query's original GROUP BY level if the summary has additional dimensions
    4. Applies ORDER BY and LIMIT clauses

This approach eliminates the need for manual query optimization while maintaining 100% routing accuracy and sub-100ms query performance.

Key Technical Components

1. AdaptiveOptimizer (adaptive_optimizer.py)

The core analysis engine that processes queries and generates summary table specifications:

Main Methods:

  • analyze_queries(): Two-pass algorithm that first collects query specs, then merges compatible ones
  • _create_summary_spec(): Analyzes individual queries to determine optimal summary table structure
  • _merge_summary_specs(): Groups specs by signature (GROUP BY + filters) and merges compatible queries
  • generate_summary_table_sql(): Generates CREATE TABLE SQL with pre-applied constant filters

Key Algorithm - Filter Classification:

For each WHERE condition:
  If column in query's GROUP BY:
    β†’ Skip (will filter at query time)
  Else if operator is equality (=):
    β†’ Pre-apply in summary table (constant_filters)
  Else (BETWEEN, <, >, etc.):
    β†’ Add column to summary GROUP BY (filter_dimensions)

2. AdaptiveRouter (adaptive_router.py)

Matches incoming queries to summary tables and rewrites SQL for optimal execution:

Matching Algorithm:

  • Compares query's GROUP BY and constant filters against each summary table's signature
  • Checks if summary table contains all required aggregations
  • Verifies that summary's filter dimensions can satisfy the query's filters

SQL Rewriting:

  • Replaces main table with summary table name
  • Removes pre-applied constant filters from WHERE clause
  • Adds re-aggregation step if summary has more dimensions than query needs
  • Preserves ORDER BY, LIMIT, and other query clauses

3. Database Builder (database_builder.py)

Constructs the optimized database with sorted main table and summary tables:

Optimization Techniques:

  • Physical sorting by (day, type, country, publisher_id) enables DuckDB's zonemap pruning
  • Batch creation of summary tables with progress tracking
  • Automatic detection and handling of high-cardinality scenarios

Design Advantages

Why This Approach Works

  1. Automatic Optimization: No manual query tuning required - the system analyzes and optimizes automatically
  2. Storage Efficiency: Intelligent merging prevents redundant summary tables; high-cardinality detection avoids bloat
  3. Query Flexibility: Two-stage filtering (pre-apply + query-time) supports both constant and variable filters
  4. Performance Guarantee: Pre-aggregation + zonemap pruning ensures sub-100ms queries even on 10M+ row datasets
  5. 100% Routing Accuracy: Precise structural matching guarantees correct results for every query

Key Design Decisions

Why Pre-Apply Only Equality Filters?

  • Equality filters (e.g., type = 'impression') are constants that never change
  • Pre-applying them reduces summary table size dramatically
  • Range filters (BETWEEN, <, >) vary per query, so we keep the dimension and filter at query time

Why Add Filter Columns to GROUP BY?

  • Allows query-time filtering on ranges without losing pre-aggregation benefits
  • Example: For day BETWEEN '2024-01-01' AND '2024-01-31', we GROUP BY day in the summary, then filter at query time
  • Enables one summary table to serve multiple queries with different date ranges

Why Merge Compatible Queries?

  • Reduces storage: One table with SUM(x), AVG(x), COUNT(x) instead of three separate tables
  • Simplifies routing: Fewer tables to check during query matching
  • Improves build time: Fewer CREATE TABLE operations

Why Skip High-Cardinality Summaries?

  • GROUP BY minute creates 500K+ rows - larger than the filtered main table
  • Without selective filters, the summary offers no performance benefit
  • Sorted main table with zonemap pruning is faster for these queries

Example: How It Works End-to-End

Consider these two queries:

Q1: SELECT publisher_id, SUM(bid_price) 
    FROM events 
    WHERE type='impression' AND country='US'
    GROUP BY publisher_id

Q2: SELECT publisher_id, AVG(bid_price) 
    FROM events 
    WHERE type='impression' AND country='US' AND day BETWEEN '2024-01-01' AND '2024-01-31'
    GROUP BY publisher_id

Phase 1: Analysis

  • Both queries have identical GROUP BY (publisher_id) and constant filters (type='impression', country='US')
  • Q2 has an additional BETWEEN filter on day
  • System adds day to the summary GROUP BY to enable query-time filtering
  • Queries are merged into one summary table with both SUM and AVG aggregations

Phase 2: Summary Table Creation

CREATE TABLE summary_q1_merged_day_publisher_id AS
SELECT 
    day,                           -- Added for Q2's BETWEEN filter
    publisher_id,                  -- Original GROUP BY
    SUM(bid_price) AS sum_bid_price,
    AVG(bid_price) AS avg_bid_price
FROM events
WHERE type='impression' AND country='US'  -- Pre-applied constant filters
GROUP BY day, publisher_id

Phase 3: Query Execution

  • Q1 is rewritten to:

    SELECT publisher_id, SUM(sum_bid_price) AS sum_bid_price
    FROM summary_q1_merged_day_publisher_id
    GROUP BY publisher_id  -- Re-aggregate across all days
  • Q2 is rewritten to:

    SELECT publisher_id, SUM(sum_bid_price * row_count) / SUM(row_count) AS avg_bid_price
    FROM summary_q1_merged_day_publisher_id
    WHERE day BETWEEN '2024-01-01' AND '2024-01-31'  -- Query-time filter
    GROUP BY publisher_id  -- Re-aggregate for selected days

Result: Both queries execute in <5ms using the same pre-aggregated summary table!

Benchmarks against sample queries:

Query 1: {'select': ['day', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}], 'group_by': ['day']}

Rows: 366 | Time: 0.001s

Query 2: {'select': ['publisher_id', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'country', 'op': 'eq', 'val': 'JP'}, {'col': 'day', 'op': 'between', 'val': ['2024-10-20', '2024-10-23']}], 'group_by': ['publisher_id']}

Rows: 1114 | Time: 0.006s

Query 3: {'select': ['country', {'AVG': 'total_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'purchase'}], 'group_by': ['country'], 'order_by': [{'col': 'AVG(total_price)', 'dir': 'desc'}]}

Rows: 12 | Time: 0.005s

Query 4: {'select': ['advertiser_id', 'type', {'COUNT': ''}], 'from': 'events', 'group_by': ['advertiser_id', 'type'], 'order_by': [{'col': 'COUNT()', 'dir': 'desc'}]}

Rows: 6616 | Time: 0.006s

Query 5: {'select': ['minute', {'SUM': 'bid_price'}], 'from': 'events', 'where': [{'col': 'type', 'op': 'eq', 'val': 'impression'}, {'col': 'day', 'op': 'eq', 'val': '2024-06-01'}], 'group_by': ['minute'], 'order_by': [{'col': 'minute', 'dir': 'asc'}]}

Rows: 1440 | Time: 0.002s

Summary: Q1: 0.001s (366 rows) Q2: 0.006s (1114 rows) Q3: 0.005s (12 rows) Q4: 0.006s (6616 rows) Q5: 0.002s (1440 rows) Total time: 0.032s

How to Setup

1. Install the requirements

pip install -r requirements.txt

2. Convert CSV files to Parquet format

python convert_csv_to_parquet.py --input-dir ../data --output-dir ../data_parquet

This will convert all CSV files in the ../data directory to Parquet format and save them in ../data_parquet.

3. Load data and execute queries

First, create the output directory:

mkdir ../outputs

Then run the main script:

python run_all.py --data-dir ../data_parquet --out-dir ../outputs

For averaging, simply run:

python run_all.py --data-dir ../data_paraquet --out-dir ../outputs --skip-main-table

The main table created in phase 1 remains constant.

This will analyze queries, build the optimized database with adaptive summary tables, execute all queries, and save results to CSV files in the ../outputs directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages