[C++] [Parquet] Use std::count in parquet ColumnReader

### Describe the enhancement requested

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073
transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

### Component(s)

C++, Parquet

	void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels,
	int64_t* num_def_levels, int64_t* values_to_read) {
	batch_size =
	std::min(batch_size, this->num_buffered_values_ - this->num_decoded_values_);

	// If the field is required and non-repeated, there are no definition levels
	if (this->max_def_level_ > 0 && def_levels != nullptr) {
	*num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
	// TODO(wesm): this tallying of values-to-decode can be performed with better
	// cache-efficiency if fused with the level decoding.
	for (int64_t i = 0; i < *num_def_levels; ++i) {
	if (def_levels[i] == this->max_def_level_) {
	++(*values_to_read);
	}
	}
	} else {
	// Required field, read all values
	*values_to_read = batch_size;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] [Parquet] Use std::count in parquet ColumnReader #39398

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] [Parquet] Use std::count in parquet ColumnReader #39398

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions