-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C#] Performance issue of reading StringArray #41047
Description
Describe the enhancement requested
The general principle of Zero Copy does not work well in case of StringArray in the C# library. This is because the value buffer is UTF 8 encoded, while C# uses wide char. So, for reading each value, we need to go through UTF 8 decoding. This is especially bad in the case of reading a DictonaryArray of string value type since the value array is guaranteed to store unique strings, but the StringArray API forces reader code to decode string on encountered offsets repeatedly. In our profiling, we tested dictionary array of string and int columns in the same RecordBatch, and we see dominant CPU used on calling StringArray.GetString() comparing to reading int column.
C++ library on the other hand does not have this issue because Arrow C++ API exposes std::string and std::string_view, which work with UTF 8 natively.
Component(s)
C#