Skip to content

GH-39789: [Go][Parquet] Close current row group when finished writing unbuffered batch#43326

Merged
zeroshade merged 1 commit intoapache:mainfrom
joellubi:gh-1997-writer
Jul 19, 2024
Merged

GH-39789: [Go][Parquet] Close current row group when finished writing unbuffered batch#43326
zeroshade merged 1 commit intoapache:mainfrom
joellubi:gh-1997-writer

Conversation

@joellubi
Copy link
Copy Markdown
Member

@joellubi joellubi commented Jul 18, 2024

Rationale for this change

Fixes: #39789

The number of bytes reported by FileWriter.RowGroupTotalBytesWritten() was consistently lower than the actual bytes in the output buffer, if it was read before closing the writer. The issue is that the last column's data page was not flushed until the entire writer was closed, causing it's bytes not to be included in the total. By closing the row group writer before returning from Write(), we can ensure all pages are flushed and the totalBytesWritten will be accurate.

What changes are included in this PR?

  • Close row group writer before returning from FileWriter.Write()
  • Test to ensure stats are up to date before closing the writer

Are these changes tested?

Yes

Are there any user-facing changes?

FileWriter.RowGroupTotalBytesWritten() will be accurate when read while still writing to the file.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39789 has been automatically assigned in GitHub to PR creator.

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 0 benchmarking runs that have been run so far on merge-commit ed67a42.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

zeroshade pushed a commit to apache/arrow-adbc that referenced this pull request Jul 29, 2024
…rgetSize on ingestion (#2026)

Fixes: #1997 

**Core Changes**
- Change ingestion `writeParquet` function to use unbuffered writer,
skipping 0-row records to avoid recurrence of #1847
- Use parquet writer's internal `RowGroupTotalBytesWritten()` method to
track output file size in favor of `limitWriter`
- Unit test to validate that file cutoff occurs precisely when expected

**Secondary Changes**
- Bump arrow dependency to `v18` to pull in the changes from
[ARROW-43326](apache/arrow#43326)
- Fix flightsql test that depends on hardcoded arrow version
lidavidm pushed a commit to adbc-drivers/snowflake that referenced this pull request Oct 9, 2025
…(#2026)

Fixes: #1997 

**Core Changes**
- Change ingestion `writeParquet` function to use unbuffered writer,
skipping 0-row records to avoid recurrence of #1847
- Use parquet writer's internal `RowGroupTotalBytesWritten()` method to
track output file size in favor of `limitWriter`
- Unit test to validate that file cutoff occurs precisely when expected

**Secondary Changes**
- Bump arrow dependency to `v18` to pull in the changes from
[ARROW-43326](apache/arrow#43326)
- Fix flightsql test that depends on hardcoded arrow version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Go][Parquet] Potential inconsistency between TotalBytesWritten tracked by RowGroupWriter and actual bytes written to io.Writer

2 participants