Skip to content

ColumnReader ReadBatch ignores definition levels when they weren't requested to output #39381

@Hattonuri

Description

@Hattonuri

Describe the bug, including details regarding any error messages, version, and platform.

I write optional ints and read them using parquet::Int64Reader using
reader->ReadBatch(kBatchSizeOne, nullptr, nullptr, &tmp, &values_read)
(like it is in parquet::StreamReader)
And this method always returns 1(and shifts it's internal value index) as the result of values_read. But it should return 0 for nulls and 1 for non-empty values.
When i change nullptr to any pointer-to-value(and ignore result) everything works fine

I assume that the issue is here

if (this->max_def_level_ > 0 && def_levels != nullptr) {
*num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
// TODO(wesm): this tallying of values-to-decode can be performed with better
// cache-efficiency if fused with the level decoding.
for (int64_t i = 0; i < *num_def_levels; ++i) {
if (def_levels[i] == this->max_def_level_) {
++(*values_to_read);
}
}
} else {
// Required field, read all values
*values_to_read = batch_size;
}

Which is used in ReadBatch
Because if def_levels == nullptr that does not mean that the field is required

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions