GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData#39337
GH-39336: [C++][Parquet] Minor: Style enhancement for parquet::FileMetaData#39337mapleFU merged 7 commits intoapache:mainfrom
Conversation
|
|
pitrou
left a comment
There was a problem hiding this comment.
LGTM, but see docstring suggestions
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
comment fixed now |
cpp/src/parquet/metadata.cc
Outdated
| // Discard row groups that are not in the subset | ||
| output_metadata->num_rows = 0; | ||
| output_metadata->row_groups.clear(); |
There was a problem hiding this comment.
Note:
Here will introduce a more round for copying metadata_->row_groups.
Should I:
bool isset_rowgroup = metadata->isset.row_groups;
metadata->isset.row_groups = false;
auto row_groups = std::move(metadata->row_groups);
// Copying thrift
// ...
// Assign row_groups back
metadata->row_groups = std::move(row_groups);
metadata->isset.row_groups = ...
// .. Handle RowGroups
| const std::shared_ptr<const KeyValueMetadata>& key_value_metadata) { | ||
| int64_t total_rows = 0; | ||
| for (auto row_group : row_groups_) { | ||
| for (const auto& row_group : row_groups_) { |
There was a problem hiding this comment.
This is another problem I found during reading the code
cpp/src/parquet/metadata.cc
Outdated
| { | ||
| // GH-39336: Copy the metadata, but only the row groups we need. | ||
| auto temp_row_groups = std::move(this->metadata_->row_groups); | ||
| try { |
There was a problem hiding this comment.
I think make_unique and copying might throw exception( like std::bad_alloc), and even when ex is thrown, we'd better keep this not changed (otherwise it would be hard to debugging?)
There was a problem hiding this comment.
I'm not sure if it worths this kind of complexity to handle the exception just in order to avoid the copy.
cpp/src/parquet/metadata.cc
Outdated
| metadata->schema = metadata_->schema; | ||
| { | ||
| // GH-39336: Copy the metadata, but only the row groups we need. | ||
| auto temp_row_groups = std::move(this->metadata_->row_groups); |
There was a problem hiding this comment.
What is the point of doing this? Overall, are these changes actually bringing a tangible benefit?
There was a problem hiding this comment.
The origin code don't copying the previous row-group. If we have a large file with many row-groups (it's possible), the more around of copy waste a lot of cpu time. So I do this as an optimization.
There was a problem hiding this comment.
My problem is that you're temporarily mutating this in a method that's not supposed to mutate it. This is brittle (look at that try/except) and also thread-unsafe.
There was a problem hiding this comment.
You're right, the thread-safety is a problem here. Maybe keeping the current code is ok
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit f15a700. There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…FileMetaData (apache#39337) ### Rationale for this change Enhance the style of `parquet::FileMetaData::Subset`. ### Are these changes tested? Already tested ### Are there any user-facing changes? no * Closes: apache#39336 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
Enhance the style of
parquet::FileMetaData::Subset.Are these changes tested?
Already tested
Are there any user-facing changes?
no