Skip to content

Ability to append non contiguous strings to StringBuilder #6347

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion has an optimized version of concat(col1, ...) for strings (I believe contributed by @JasonLi-cn ) that avoids:

  1. Copying strings multiple times
  2. Manipulating nulls uncessarly

To do this today, we added StringArrayBuilder, which is similar but not hte same as StringBuilder in arrow
https://github.com/apache/datafusion/blob/4838cfbf453f3c21d9c5a84f9577329dd78aa763/datafusion/functions/src/string/common.rs#L354-L417

The major differences are:

  1. You can call write to incrementally build up each string and then call append_offset to create each string. StringBuilder requires each input to be a single contiguous string to call https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value
  2. You can call finish() with the specified null buffer (rather than building it up incrementally)

Describe the solution you'd like
I think it is worth figuring out how to create a similar API for StringBuilder

Incrementally wirte values

Here is one ideal suggestion of how to write values that I think would be relatively easy to use:

let mut builder = StringBuilder::with_capacity(...);
// scope for lifetime
{ 
  // get something that implements std::io::write
  let writable = builder.writeable();
  write!(writeable, "foo"); // append "foo" to the inprogress string
  write!(writeable, "bar"); // append "bar" to the inprogress string
} // scope close, finishes the string "foobar"

Similarly, adding a finish_with_nulls(..) type function that took a NullBuffer would be beneficial if the caller already knew about nulls

Describe alternatives you've considered

We could not do this at all (or just keep the code downstream in DataFusion)

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions