Skip to content

Improve LIKE performance for Dictionary arrays #11032

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We have a predicate like col LIKE <...> where col is a DictionaryArray. The filter ends up looking like this:

FilterExec: CAST(col@3 AS Utf8) LIKE '%CONST%' elapsed_compute=38.62343607s]

This is non ideal (in our case this requires many seconds of time) because it will copy each dictionary value column into a StringArray before it applies LIKE

Here is an self contained reproducer:

create or replace table strings as values
  ('FooBar'),
  ('Foo'),
  ('Foo'),
  ('Bar'),
  ('FooBar'),
  ('Bar'),
  ('Baz');
  
create or replace table dict_table as
select arrow_cast(column1, 'Dictionary(Int32, Utf8)') as column1
from strings;

> select column1, arrow_typeof(column1) from dict_table;
+---------+----------------------------------+
| column1 | arrow_typeof(dict_table.column1) |
+---------+----------------------------------+
| FooBar  | Dictionary(Int32, Utf8)          |
| Foo     | Dictionary(Int32, Utf8)          |
| Foo     | Dictionary(Int32, Utf8)          |
| Bar     | Dictionary(Int32, Utf8)          |
| FooBar  | Dictionary(Int32, Utf8)          |
| Bar     | Dictionary(Int32, Utf8)          |
| Baz     | Dictionary(Int32, Utf8)          |
+---------+----------------------------------+
7 row(s) fetched.
Elapsed 0.002 seconds.

The query gets the correct anser:

> select column1 from dict_table where column1 LIKE '%oo%';
+---------+
| column1 |
+---------+
| FooBar  |
| Foo     |
| Foo     |
| FooBar  |
+---------+
4 row(s) fetched.
Elapsed 0.010 seconds.

However, the plan shows it is casting to Utf8 which is quite expensive:

> explain select column1 from dict_table where column1 LIKE '%oo%';
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| logical_plan  | Filter: CAST(dict_table.column1 AS Utf8) LIKE Utf8("%oo%") |
|               |   TableScan: dict_table projection=[column1]               |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                |
|               |   FilterExec: CAST(column1@0 AS Utf8) LIKE %oo%            | <--- NOTE THIS HAS A CAST
|               |     MemoryExec: partitions=1, partition_sizes=[1]          |
|               |                                                            |
+---------------+------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.

Describe the solution you'd like

I would like LIKE to operate much faster on DictionaryArrays (without a cast)

Describe alternatives you've considered

What would likely be much faster is done is to evaluate LIKE on the dictionary values first to find any matching keys, and then look up the values

There is code in the arrow like kernels that appears to handle dictionary values, so this "just may work" if we update the coercion rules to avoid the cast.

https://docs.rs/arrow-string/52.0.0/src/arrow_string/like.rs.html#122-126

However, we may also have to implement something like

// find all keys in the dictionary
let matching_values = like(dictionary_array.values(), scalar);
// figure out which values in the dictionary match by callking take
let dict_indicies = dictionary_array.keys();
// now, leave the overall result here 
let like_result = take(matching_values, dict_indices);

Additional context

Note potentially related to LIKE for StringViewArray -- see #11024

Also potentially related to #5222

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions