-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Our project (IOx) uses dictionary arrays heavily to compress arrays of strings (that are mostly the same value). When DictionaryArrays are concatenated together the contents of the dictionaries are also concatenated resulting in duplication and some dictionary entries which are unused.
Similarly, when filtering a DictionaryArray some values may be filtered entirely and thus the dictionary may include values that are repeated.
Describe the solution you'd like
- Add an
optimize_dictionariesfunction which would return a potentially new array such that its dictionary had the following properties:
a) Every value in the dictionary is unique
b) Every value in the dictionary has at least one use in the array' values
So given an input like
Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, D, E, F, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}
The output of optimize_dictionaroes would be (note no duplication in dictionary)
Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}
- Add a function such as
concat_and_optimize_batchesthat would do the array compaction as part ofconcatenation(and thus avoid creating an intermediateconcatresult which was then optimized)
The solution described in #504 is designed to avoid copying dictionaries when they are identical (aka literally point at the same array)
Describe alternatives you've considered
An alternative might be to extend concat_batches directly to always try and compact arrays. However, since this involves a tradeoff of more CPU for potentially less memory usage, I am not sure it is appropriate to always apply.
Additional context
@tustvold has implemented a version of this code in IOx: https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable for inclusion in arrow
See also #504,