-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion has an optimized version of concat(col1, ...) for strings (I believe contributed by @JasonLi-cn ) that avoids:
- Copying strings multiple times
- Manipulating nulls uncessarly
To do this today, we added StringArrayBuilder, which is similar but not hte same as StringBuilder in arrow
https://github.com/apache/datafusion/blob/4838cfbf453f3c21d9c5a84f9577329dd78aa763/datafusion/functions/src/string/common.rs#L354-L417
The major differences are:
- You can call
writeto incrementally build up each string and then callappend_offsetto create each string.StringBuilderrequires each input to be a single contiguous string to call https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html#method.append_value - You can call finish() with the specified null buffer (rather than building it up incrementally)
Describe the solution you'd like
I think it is worth figuring out how to create a similar API for StringBuilder
Incrementally wirte values
Here is one ideal suggestion of how to write values that I think would be relatively easy to use:
let mut builder = StringBuilder::with_capacity(...);
// scope for lifetime
{
// get something that implements std::io::write
let writable = builder.writeable();
write!(writeable, "foo"); // append "foo" to the inprogress string
write!(writeable, "bar"); // append "bar" to the inprogress string
} // scope close, finishes the string "foobar"Similarly, adding a finish_with_nulls(..) type function that took a NullBuffer would be beneficial if the caller already knew about nulls
Describe alternatives you've considered
We could not do this at all (or just keep the code downstream in DataFusion)
Additional context