Skip to content

Conversation

@jqnatividad
Copy link
Collaborator

    --limit <n>              Limit the number of simultaneously open files.
                             Useful for partitioning large datasets with many
                             unique values to avoid "too many open files" errors.
                             Data is processed in batches until all unique values
                             are processed.
                             If not set, it will be automatically set to the
                             system limit with a 10% safety margin.
                             If set to 0, it will process all data at once,
                             regardless of the system's open files limit.

resolves #2959

cc @harto

@jqnatividad jqnatividad requested a review from Copilot September 3, 2025 14:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a --limit option to the partition command to control the number of simultaneously open files, addressing the "too many open files" error when partitioning large datasets with many unique values. The implementation processes data in batches when a limit is specified, automatically defaults to 90% of the system limit for safety, and allows unlimited processing when set to 0.

Key changes:

  • Added --limit parameter with automatic system limit detection and validation
  • Refactored partition logic to support batched processing with file limit constraints
  • Added comprehensive test coverage for the new batching functionality

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/cmd/partition.rs Implements the core batching logic, system limit detection, and refactors existing partition methods to support file limits
tests/test_partition.rs Adds test case verifying that partition works correctly with file limits while maintaining data integrity

also applied clippy lint below

warning: deref on an immutable reference
   --> src/cmd/partition.rs:296:38
    |
296 |                     wtr.write_record(&*headers)?;
    |                                      ^^^^^^^^^ help: if you would like to reborrow, try removing `&*`: `headers`
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#borrow_deref_ref
    = note: `#[warn(clippy::borrow_deref_ref)]` on by default
…word and the r# prefix is just confusing

[skip ci]
@jqnatividad jqnatividad merged commit 28bab5f into master Sep 3, 2025
1 check was pending
@jqnatividad jqnatividad deleted the 2959-partition-limit-option branch September 3, 2025 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: partition add option to limit number of simultaneously open files

2 participants