Skip to content

"Optimize" Dictionary contents in DictionaryArray / concat_batches #506

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Our project (IOx) uses dictionary arrays heavily to compress arrays of strings (that are mostly the same value). When DictionaryArrays are concatenated together the contents of the dictionaries are also concatenated resulting in duplication and some dictionary entries which are unused.

Similarly, when filtering a DictionaryArray some values may be filtered entirely and thus the dictionary may include values that are repeated.

Describe the solution you'd like

  1. Add an optimize_dictionaries function which would return a potentially new array such that its dictionary had the following properties:
    a) Every value in the dictionary is unique
    b) Every value in the dictionary has at least one use in the array' values

So given an input like

Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C, D, E, F, A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

The output of optimize_dictionaroes would be (note no duplication in dictionary)
Array: "A, C, A, B, C, C, A"
Dictionary: {A, B, C}
Values: {0, 2, 0, 1, 2, 2, 0}

  1. Add a function such as concat_and_optimize_batches that would do the array compaction as part of concatenation (and thus avoid creating an intermediate concat result which was then optimized)

The solution described in #504 is designed to avoid copying dictionaries when they are identical (aka literally point at the same array)

Describe alternatives you've considered
An alternative might be to extend concat_batches directly to always try and compact arrays. However, since this involves a tradeoff of more CPU for potentially less memory usage, I am not sure it is appropriate to always apply.

Additional context

@tustvold has implemented a version of this code in IOx: https://github.com/influxdata/influxdb_iox/pull/1830 which could be suitable for inclusion in arrow

See also #504,

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions