Skip to content

Better memory limiting in parquet ArrowWriter  #5484

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@DDtKey suggested in #5457 #5457 (review)

Describe the solution you'd like

I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔

The idea is to add an additional option to force the writer to flush when its buffered data hits a certain limit.

Describe alternatives you've considered

The challenge is how to enforce buffer limiting without slowing down encoding. One idea would be to checking memory usage after completing encoding each RecordBatch. This would be imprecise (the writer could go over), as noted by @tustvold , but the overage would be bounded to the size of one RecordBatch (which the user could control)

Since all writers
This might look like adding something like this to the ArrowWriter:

let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None)
  .unwrap()
  // flush when buffered parquet data exceeds 10MB
  .with_target_buffer_size(10*1024*1024)

Since not all the parquet writers buffer their data like this, I think it doesn't make sense to put the buffer size on the WriterProperties struct.

Additional context
@tustvold documented the current behavior better in #5457

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions