Skip to content

[C#] Performance issue of reading StringArray #41047

@keshen-msft

Description

@keshen-msft

Describe the enhancement requested

The general principle of Zero Copy does not work well in case of StringArray in the C# library. This is because the value buffer is UTF 8 encoded, while C# uses wide char. So, for reading each value, we need to go through UTF 8 decoding. This is especially bad in the case of reading a DictonaryArray of string value type since the value array is guaranteed to store unique strings, but the StringArray API forces reader code to decode string on encountered offsets repeatedly. In our profiling, we tested dictionary array of string and int columns in the same RecordBatch, and we see dominant CPU used on calling StringArray.GetString() comparing to reading int column.

C++ library on the other hand does not have this issue because Arrow C++ API exposes std::string and std::string_view, which work with UTF 8 natively.

Component(s)

C#

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions