Skip to content

Further improve performance of IN list evaluation #19241

@geoffreyclaude

Description

@geoffreyclaude

Is your feature request related to a problem or challenge?

IN list filters have become critical-path operations for dynamic filter pushdown. When scanning large tables with partition pruning or dynamic filters, the IN expression is evaluated millions of times per query. The current generic implementation leaves significant performance on the table.

Describe the solution you'd like

Implement type-specialized StaticFilter implementations that exploit the properties of different data types to achieve substantial performance improvements.

Optimization Strategies

1. Bitmap Filters for Small Integer Types

For 1-byte and 2-byte types, use fixed-size bitmaps for O(1) membership testing:

  • UInt8/Int8: 256-bit bitmap (32 bytes, stack-allocated, fits in cache line)
  • UInt16/Int16: 65536-bit bitmap (8KB, heap-allocated, fits in L1 cache)

2. Const-Generic Branchless Evaluation

For small IN lists on larger primitives, use compile-time-known array sizes:

  • Loop unrolling via fold(false, |acc, &v| acc | (v == needle))
  • Branch elimination (no conditional jumps)
  • Thresholds: ≤32 for 4-byte, ≤16 for 8-byte, ≤4 for 16-byte types

3. Direct Probe Hash Filter

For larger primitive lists, use a custom open-addressing hash table:

  • Linear probing with 25% load factor
  • Golden-ratio multiply hash function
  • Direct array storage (no indirection like std::HashSet)

4. Type Normalization (Zero-Copy)

Normalize types via buffer reinterpretation (only bit patterns matter for equality):

  • Signed integers → Unsigned (Int32UInt32)
  • Floats → Unsigned (Float32UInt32)
  • Timestamp/Date/Duration → underlying integer type

5. Two-Stage String Filters

  • Stage 1: O(1) rejection using length + prefix encoded as i128
  • Stage 2: Full verification only for long string candidates

Filter Selection Strategy

instantiate_static_filter(array)
│
├─ Utf8View + all_short_strings? ──────────► Branchless/Hash i128
│
├─ primitive_width=1 ──────────────────────► Bitmap (32 bytes)
├─ primitive_width=2 ──────────────────────► Bitmap (8 KB)
│
├─ primitive_width=4
│  ├─ len ≤ 32 ────────────────────────────► Branchless<UInt32, N>
│  └─ len > 32 ────────────────────────────► DirectProbeFilter<UInt32>
│
├─ primitive_width=8
│  ├─ len ≤ 16 ────────────────────────────► Branchless<UInt64, N>
│  └─ len > 16 ────────────────────────────► DirectProbeFilter<UInt64>
│
├─ primitive_width=16
│  ├─ len ≤ 4 ─────────────────────────────► Branchless<Decimal128, N>
│  └─ len > 4 ─────────────────────────────► DirectProbeFilter<Decimal128>
│
├─ Utf8 / LargeUtf8 ───────────────────────► Utf8TwoStageFilter
├─ Utf8View / BinaryView ──────────────────► ByteViewMaskedFilter
│
└─ Everything else ────────────────────────► NestedTypeFilter (fallback)

Benchmark Results

Benchmarks from the reference implementation (see #19390 ):

Category Representative Benchmark Speedup
Bitmap (1-byte) u8/list=16/match=50% 8.5×
Bitmap (2-byte) i16/list=64/match=50% 8.5×
Branchless (4-byte) i32/list=4/match=50% 10.4×
Branchless (8-byte) i64/list=4/match=50% 11.1×
Branchless (timestamp) timestamp_ns/list=4/match=50% 13.6×
Hash (large lists) i32/hash/list=256/match=50% 2.4×
Utf8View (short) utf8view/short_8b/nulls=50% 6.3×
Utf8View (boundary) utf8view/boundary_12b/match=50% 7.0×
Utf8 (short) utf8/short_8b/list=256/match=50% 2.3×
Realistic port_numbers/pool=1000/filter=10 3.9×
Realistic status_codes/nulls=20% 4.8×
Realistic email_addresses/list=20 3.0×

Implementation Plan

A reference implementation is available in #19390 . To facilitate review, it will be submitted as a series of incremental PRs:

PR 1: Comprehensive Benchmark Suite

  • Add benches/in_list_optimizations.rs covering all optimization paths

PR 2: Modular StaticFilter Architecture

  • Refactor InListExpr to use trait-based StaticFilter architecture
  • Add NestedTypeFilter as fallback for complex types

PR 3: Bitmap Filter for UInt8 (stack-based)

  • Add BitmapFilter<U8Config> (stack-allocated, 32 bytes)

PR 4: Bitmap Filter for UInt16 (heap-based)

  • Add BitmapFilter<U16Config> (heap-allocated, 8KB)

PR 5: Zero-Copy Type Reinterpretation

  • Add reinterpret_any_primitive_to<T>() for buffer reinterpretation
  • Enable Int8/Int16 to use bitmap filters

PR 6: Branchless Filter for Small Primitive Lists

  • Add BranchlessFilter<T, N> with const-generic sizes (0-32)
  • Implement ReinterpretedBranchless for type normalization

PR 7: Direct Probe Hash Filter for Large Primitive Lists

  • Add DirectProbeFilter<T> with open-addressing hash table
  • Implement DirectProbeHashable trait with golden-ratio hash

PR 8: ByteView Two-Stage Filter (Utf8View/BinaryView)

  • Add ByteViewMaskedFilter<T> for Utf8View/BinaryView
  • Implement masked view quick rejection + HashTable verification

PR 9: Utf8/LargeUtf8 Two-Stage Filter

  • Add Utf8TwoStageFilter<O> for Utf8/LargeUtf8
  • Implement length+prefix encoding as i128

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions