-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
IN list filters have become critical-path operations for dynamic filter pushdown. When scanning large tables with partition pruning or dynamic filters, the IN expression is evaluated millions of times per query. The current generic implementation leaves significant performance on the table.
Describe the solution you'd like
Implement type-specialized StaticFilter implementations that exploit the properties of different data types to achieve substantial performance improvements.
Optimization Strategies
1. Bitmap Filters for Small Integer Types
For 1-byte and 2-byte types, use fixed-size bitmaps for O(1) membership testing:
- UInt8/Int8: 256-bit bitmap (32 bytes, stack-allocated, fits in cache line)
- UInt16/Int16: 65536-bit bitmap (8KB, heap-allocated, fits in L1 cache)
2. Const-Generic Branchless Evaluation
For small IN lists on larger primitives, use compile-time-known array sizes:
- Loop unrolling via
fold(false, |acc, &v| acc | (v == needle)) - Branch elimination (no conditional jumps)
- Thresholds: ≤32 for 4-byte, ≤16 for 8-byte, ≤4 for 16-byte types
3. Direct Probe Hash Filter
For larger primitive lists, use a custom open-addressing hash table:
- Linear probing with 25% load factor
- Golden-ratio multiply hash function
- Direct array storage (no indirection like
std::HashSet)
4. Type Normalization (Zero-Copy)
Normalize types via buffer reinterpretation (only bit patterns matter for equality):
- Signed integers → Unsigned (
Int32→UInt32) - Floats → Unsigned (
Float32→UInt32) - Timestamp/Date/Duration → underlying integer type
5. Two-Stage String Filters
- Stage 1: O(1) rejection using length + prefix encoded as i128
- Stage 2: Full verification only for long string candidates
Filter Selection Strategy
instantiate_static_filter(array)
│
├─ Utf8View + all_short_strings? ──────────► Branchless/Hash i128
│
├─ primitive_width=1 ──────────────────────► Bitmap (32 bytes)
├─ primitive_width=2 ──────────────────────► Bitmap (8 KB)
│
├─ primitive_width=4
│ ├─ len ≤ 32 ────────────────────────────► Branchless<UInt32, N>
│ └─ len > 32 ────────────────────────────► DirectProbeFilter<UInt32>
│
├─ primitive_width=8
│ ├─ len ≤ 16 ────────────────────────────► Branchless<UInt64, N>
│ └─ len > 16 ────────────────────────────► DirectProbeFilter<UInt64>
│
├─ primitive_width=16
│ ├─ len ≤ 4 ─────────────────────────────► Branchless<Decimal128, N>
│ └─ len > 4 ─────────────────────────────► DirectProbeFilter<Decimal128>
│
├─ Utf8 / LargeUtf8 ───────────────────────► Utf8TwoStageFilter
├─ Utf8View / BinaryView ──────────────────► ByteViewMaskedFilter
│
└─ Everything else ────────────────────────► NestedTypeFilter (fallback)
Benchmark Results
Benchmarks from the reference implementation (see #19390 ):
| Category | Representative Benchmark | Speedup |
|---|---|---|
| Bitmap (1-byte) | u8/list=16/match=50% | 8.5× |
| Bitmap (2-byte) | i16/list=64/match=50% | 8.5× |
| Branchless (4-byte) | i32/list=4/match=50% | 10.4× |
| Branchless (8-byte) | i64/list=4/match=50% | 11.1× |
| Branchless (timestamp) | timestamp_ns/list=4/match=50% | 13.6× |
| Hash (large lists) | i32/hash/list=256/match=50% | 2.4× |
| Utf8View (short) | utf8view/short_8b/nulls=50% | 6.3× |
| Utf8View (boundary) | utf8view/boundary_12b/match=50% | 7.0× |
| Utf8 (short) | utf8/short_8b/list=256/match=50% | 2.3× |
| Realistic | port_numbers/pool=1000/filter=10 | 3.9× |
| Realistic | status_codes/nulls=20% | 4.8× |
| Realistic | email_addresses/list=20 | 3.0× |
Implementation Plan
A reference implementation is available in #19390 . To facilitate review, it will be submitted as a series of incremental PRs:
PR 1: Comprehensive Benchmark Suite
- Add
benches/in_list_optimizations.rscovering all optimization paths
PR 2: Modular StaticFilter Architecture
- Refactor
InListExprto use trait-basedStaticFilterarchitecture - Add
NestedTypeFilteras fallback for complex types
PR 3: Bitmap Filter for UInt8 (stack-based)
- Add
BitmapFilter<U8Config>(stack-allocated, 32 bytes)
PR 4: Bitmap Filter for UInt16 (heap-based)
- Add
BitmapFilter<U16Config>(heap-allocated, 8KB)
PR 5: Zero-Copy Type Reinterpretation
- Add
reinterpret_any_primitive_to<T>()for buffer reinterpretation - Enable Int8/Int16 to use bitmap filters
PR 6: Branchless Filter for Small Primitive Lists
- Add
BranchlessFilter<T, N>with const-generic sizes (0-32) - Implement
ReinterpretedBranchlessfor type normalization
PR 7: Direct Probe Hash Filter for Large Primitive Lists
- Add
DirectProbeFilter<T>with open-addressing hash table - Implement
DirectProbeHashabletrait with golden-ratio hash
PR 8: ByteView Two-Stage Filter (Utf8View/BinaryView)
- Add
ByteViewMaskedFilter<T>for Utf8View/BinaryView - Implement masked view quick rejection + HashTable verification
PR 9: Utf8/LargeUtf8 Two-Stage Filter
- Add
Utf8TwoStageFilter<O>for Utf8/LargeUtf8 - Implement length+prefix encoding as i128
Additional Context
- Follow-up to Restore IN_LIST performance -- Implement specialized
StaticFiltersfor different data types #18824 (Restore IN_LIST performance) - Thresholds (32/16/4) were tuned via microbenchmarks; binary search was tested but consistently lost to branchless/hash
- Dictionary arrays are handled transparently via unpacking, evaluation, and re-indexing
- Bitmap filters outperform branchless/hash at ALL list sizes for 1-2 byte types