GH-39965: [C++] DatasetWriter avoid creating zero-sized batch when max_rows_per_file enabled#39995
Conversation
|
|
|
@bkietz comment resolved |
| int num_batches = 0; | ||
| AssertBatchesEqual(*MakeBatch(expected_file.start, expected_file.num_rows), | ||
| *ReadAsBatch(written_file->data, &num_batches)); | ||
| if (check_num_record_batches) { |
There was a problem hiding this comment.
Can we remove bool check_num_record_batches = true argument?
There was a problem hiding this comment.
Sure, this is added in my previous patch #38885 . Here if not passing check_num_record_batches = false, the check would failed because zero-sized batch.
In this patch, it would not be produced, so I can remove it
There was a problem hiding this comment.
OK. Please remove it before we merge this.
|
Will merge if no negative comment tommorrow. |
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5f75dbf. There were 10 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
DatasetWritermight create emptyRecordBatchwhenmax_rows_per_fileenabled. This is becauseNextWritableChunkmight return a zero-sized batch when the file exactly contains the dest data.What changes are included in this PR?
Check batch-size == 0 when append to file queue
Are these changes tested?
Yes
Are there any user-facing changes?
User can avoid zero-sized row-group/batch.
ds.write_dataset#39965