-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@DDtKey suggested in #5457 #5457 (review)
Describe the solution you'd like
I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔
The idea is to add an additional option to force the writer to flush when its buffered data hits a certain limit.
Describe alternatives you've considered
The challenge is how to enforce buffer limiting without slowing down encoding. One idea would be to checking memory usage after completing encoding each RecordBatch. This would be imprecise (the writer could go over), as noted by @tustvold , but the overage would be bounded to the size of one RecordBatch (which the user could control)
Since all writers
This might look like adding something like this to the ArrowWriter:
let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None)
.unwrap()
// flush when buffered parquet data exceeds 10MB
.with_target_buffer_size(10*1024*1024)Since not all the parquet writers buffer their data like this, I think it doesn't make sense to put the buffer size on the WriterProperties struct.
Additional context
@tustvold documented the current behavior better in #5457