Skip to content

Conversation

@norberttech
Copy link
Member

Resolves: #xxx

Change Log


Added

Fixed

  • Adjusted parquet default values for page/row group size

Changed

Removed

Deprecated

Security

I noticed that with old defaults (and new approach to memory management) parquet files become pretty big.

Below benchmark helped to setup new defaults:

<?php

function formatBytes($bytes, $precision = 2) {
    $units = array('B', 'KB', 'MB', 'GB', 'TB');

    for ($i = 0; $bytes > 1024 && $i < count($units) - 1; $i++) {
        $bytes /= 1024;
    }

    return round($bytes, $precision) . ' ' . $units[$i];
}

$testCases = [
    // Current defaults
    ['page_size_kb' => 8, 'row_group_mb' => 4, 'name' => 'Current Default'],

    // Moderate increases
    ['page_size_kb' => 64, 'row_group_mb' => 16, 'name' => 'Moderate (64KB/16MB)'],
    ['page_size_kb' => 128, 'row_group_mb' => 16, 'name' => 'Moderate_v1 (128KB/16MB)'],
    ['page_size_kb' => 128, 'row_group_mb' => 32, 'name' => 'Moderate_v2 (128KB/32MB)'],
    ['page_size_kb' => 256, 'row_group_mb' => 64, 'name' => 'Large (256KB/64MB)'],

    // Aggressive increases
    ['page_size_kb' => 512, 'row_group_mb' => 128, 'name' => 'Very Large (512KB/128MB)'],
    ['page_size_kb' => 1024, 'row_group_mb' => 128, 'name' => 'Maximum (1MB/128MB)'],
];

$rows = 100_000;
echo "Testing with {$rows} rows:\n";
echo str_repeat("=", 80) . "\n";

foreach ($testCases as $test) {
    echo "\nTesting: {$test['name']}\n";
    echo str_repeat("-", 40) . "\n";

    $pageSize = $test['page_size_kb'] * 1024;
    $rowGroupSize = $test['row_group_mb'] * 1024 * 1024;

    $report = df()
        ->read(new FakeStaticOrdersExtractor($rows))
        ->drop('enum')
        ->mode(overwrite())
        ->write(
            to_parquet(__DIR__.'/test_orders.parquet', compressions: Compressions::SNAPPY)
                ->withSchema(FakeStaticOrdersExtractor::schema())
                ->withOptions(
                    Options::default()
                        ->set(Option::PAGE_SIZE_BYTES, $pageSize)
                        ->set(Option::ROW_GROUP_SIZE_BYTES, $rowGroupSize)
                        ->set(Option::VALIDATE_DATA, false)
                )
        )
        ->run(analyze: analyze());

    $fileSize = filesize(__DIR__.'/test_orders.parquet');
    $processingTime = $report->statistics()->executionTime->highResolutionTime->toString();
    $memoryUsage = $report->statistics()->memory->max()->inMb();

    echo "  File size: " . formatBytes($fileSize) . "\n";
    echo "  Memory usage: {$memoryUsage} MB\n";
    echo "  Processing time: {$processingTime}\n";
    echo "  Compression ratio: " . round($fileSize / (40 * 1024 * 1024), 2) . "x vs CSV\n";

    unlink(__DIR__.'/test_orders.parquet');
}

echo "\n" . str_repeat("=", 80) . "\n";
echo "CSV baseline: ~40MB\n";
echo "Target: <10MB file size, <20MB memory usage\n";

Which gave me following output:

Testing with 100000 rows:
================================================================================

Testing: Current Default
----------------------------------------
  File size: 208.23 MB
  Memory usage: 18.87 MB
  Processing time: 13.200125875s
  Compression ratio: 5.21x vs CSV

Testing: Moderate (64KB/16MB)
----------------------------------------
  File size: 130.14 MB
  Memory usage: 27.26 MB
  Processing time: 14.493385583s
  Compression ratio: 3.25x vs CSV

Testing: Moderate_v1 (128KB/16MB)
----------------------------------------
  File size: 128.41 MB
  Memory usage: 27.26 MB
  Processing time: 16.052906084s
  Compression ratio: 3.21x vs CSV

Testing: Moderate_v2 (128KB/32MB)
----------------------------------------
  File size: 3.98 MB
  Memory usage: 27.26 MB
  Processing time: 16.862200541s
  Compression ratio: 0.1x vs CSV

Testing: Large (256KB/64MB)
----------------------------------------
  File size: 3.95 MB
  Memory usage: 29.36 MB
  Processing time: 21.26919125s
  Compression ratio: 0.1x vs CSV

Testing: Very Large (512KB/128MB)
----------------------------------------
  File size: 3.94 MB
  Memory usage: 35.65 MB
  Processing time: 30.702356542s
  Compression ratio: 0.1x vs CSV

Testing: Maximum (1MB/128MB)
----------------------------------------
  File size: 3.94 MB
  Memory usage: 48.23 MB
  Processing time: 48.204672083s
  Compression ratio: 0.1x vs CSV

================================================================================
CSV baseline: ~40MB
Target: <10MB file size, <20MB memory usage

@codecov
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.79%. Comparing base (bf700fc) to head (06bbd6e).
Report is 1 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##              1.x    #1774   +/-   ##
=======================================
  Coverage   81.78%   81.79%           
=======================================
  Files         726      726           
  Lines       20829    20835    +6     
=======================================
+ Hits        17035    17041    +6     
  Misses       3794     3794           
Components Coverage Δ
etl 88.41% <ø> (ø)
cli 85.46% <ø> (ø)
lib-array-dot 94.56% <ø> (ø)
lib-azure-sdk 61.35% <ø> (ø)
lib-doctrine-dbal-bulk 93.88% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-types 53.55% <ø> (ø)
lib-parquet 85.50% <100.00%> (+0.01%) ⬆️
lib-parquet-viewer 83.11% <ø> (ø)
lib-snappy 89.76% <ø> (ø)
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 97.04% <ø> (ø)
bridge-openapi-specification 93.16% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.70% <ø> (ø)
adapter-csv 88.85% <ø> (ø)
adapter-doctrine 89.89% <ø> (ø)
adapter-elasticsearch 97.23% <ø> (ø)
adapter-google-sheet 83.87% <ø> (ø)
adapter-http 58.10% <ø> (ø)
adapter-json 87.98% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.95% <ø> (ø)
adapter-parquet 78.92% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 82.73% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 17, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject                | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.871mb -0.02%  | 435.292ms -1.43% | ±0.09% -79.59%  |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.566mb -0.00% | 1.068s -0.04%    | ±0.68% -35.53%  |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.666mb -0.00% | 1.684s -0.23%    | ±0.26% -42.39%  |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.463mb -0.02%  | 1.133s -1.26%    | ±0.33% -77.66%  |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 10.670mb -0.19% | 9.216s -19.55%   | ±0.56% +23.17%  |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.593mb -0.02%  | 41.733ms +0.38%  | ±0.99% +409.88% |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.579mb -0.02%  | 601.821ms +0.59% | ±0.82% +43.84%  |
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev           |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.590mb -0.01%  | 73.560ms +0.40% | ±1.76% +1448.68% |
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.328mb -0.00% | 67.857ms +1.06% | ±1.59% +27.36%   |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.532mb -0.00%  | 86.058ms +0.88%  | ±0.69% +0.49%   |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 80.613mb -0.00%  | 105.859ms +4.52% | ±1.53% +357.32% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 835.158mb -1.86% | 20.342s -24.34%  | ±0.61% +85.61%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.896mb -0.01%  | 29.699ms +1.07%  | ±1.19% -18.08%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+------------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev           |
+-------------------+----------------------------+------+-----+------------------+------------------+------------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 106.010mb -0.00% | 659.902ms +0.29% | ±0.95% +61.54%   |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.284mb -0.00%  | 334.942ms +0.09% | ±0.71% -44.72%   |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.870mb -0.01%  | 70.658ms +0.83%  | ±1.49% +92.42%   |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.540mb -0.00%  | 409.653ms +2.57% | ±0.74% -16.70%   |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.598mb -0.01%  | 80.946ms -0.27%  | ±0.98% +1019.21% |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.481mb -0.00%  | 3.683ms +11.62%  | ±3.57% +378.04%  |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.851mb -0.00% | 238.236ms -0.25% | ±0.29% -62.72%   |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.571mb -0.00%  | 24.298ms +1.49%  | ±1.08% +98.07%   |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.355mb -0.00%  | 1.741ms +32.37%  | ±2.47% +136.12%  |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.355mb -0.00%  | 1.678ms +24.37%  | ±0.78% -49.44%   |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.516mb -0.00%  | 3.557ms +6.01%   | ±2.43% +57.33%   |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 93.045mb -0.00%  | 17.432ms +3.43%  | ±1.61% +94.85%   |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 93.045mb -0.00%  | 17.626ms +5.51%  | ±2.17% +41.62%   |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.734mb -0.00%  | 2.000μs +11.49%  | ±0.00% -100.00%  |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.734mb -0.00%  | 0.400μs 0.00%    | ±0.00% 0.00%     |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.794mb -0.00% | 14.984ms +4.16%  | ±3.48% +1128.44% |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.222mb -0.00% | 73.813ms +8.99%  | ±1.21% +43.38%   |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.565mb -0.00%  | 1.497ms +32.56%  | ±1.80% -23.43%   |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.934mb -0.00%  | 63.149ms +5.13%  | ±1.11% +339.49%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.618mb -0.00%  | 4.046ms +15.66%  | ±3.21% +145.48%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.096mb -0.00%  | 40.321ms +1.25%  | ±1.46% +45.84%   |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.096mb -0.00%  | 41.043ms +3.16%  | ±1.02% +57.28%   |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.096mb -0.00%  | 40.753ms +2.78%  | ±1.21% +204.13%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.177mb -0.00%  | 8.071ms +2.13%   | ±2.07% +60.22%   |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.927mb -0.00%  | 29.346ms +1.72%  | ±0.87% +27.95%   |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.734mb -0.00%  | 15.333μs +8.33%  | ±3.45% +9.18%    |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.734mb -0.00%  | 17.100μs +7.44%  | ±0.95% -68.85%   |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.852mb -0.00% | 243.139ms +0.81% | ±0.15% -63.14%   |
+-------------------+----------------------------+------+-----+------------------+------------------+------------------+
Parquet Library
+--------------------+---------------------------------+------+-----+----------------+-------------------+-----------------+
| benchmark          | subject                         | revs | its | mem_peak       | mode              | rstdev          |
+--------------------+---------------------------------+------+-----+----------------+-------------------+-----------------+
| ParquetReaderBench | bench_page_headers              | 1    | 3   | 6.668mb -0.02% | 3.327s -0.39%     | ±0.47% +73.65%  |
| ParquetReaderBench | bench_read_metadata             | 1    | 3   | 5.353mb -0.02% | 18.189ms +0.02%   | ±0.46% +96.45%  |
| ParquetReaderBench | bench_read_schema               | 1    | 3   | 5.353mb -0.02% | 18.256ms +0.73%   | ±0.89% +66.68%  |
| ParquetReaderBench | bench_read_values_all_columns   | 1    | 3   | 9.102mb -0.22% | 5.618s -28.97%    | ±0.56% -25.19%  |
| ParquetReaderBench | bench_read_values_single_column | 1    | 3   | 6.400mb -0.31% | 230.481ms -49.71% | ±0.11% -55.48%  |
| ParquetReaderBench | bench_read_values_with_limit    | 1    | 3   | 6.930mb -0.47% | 29.108ms -12.82%  | ±1.21% +103.06% |
| ParquetWriterBench | bench_write_batch               | 1    | 3   | 9.855mb +1.73% | 194.017ms -2.55%  | ±0.20% -36.55%  |
| ParquetWriterBench | bench_write_gzip                | 1    | 3   | 9.816mb +5.40% | 219.824ms +23.03% | ±0.54% -19.78%  |
| ParquetWriterBench | bench_write_row_by_row          | 1    | 3   | 9.855mb +1.73% | 192.057ms -2.84%  | ±0.52% -33.87%  |
| ParquetWriterBench | bench_write_snappy              | 1    | 3   | 9.855mb +1.73% | 192.069ms -3.86%  | ±1.13% +466.78% |
| ParquetWriterBench | bench_write_uncompressed        | 1    | 3   | 9.632mb +4.04% | 192.264ms +17.49% | ±0.81% +113.78% |
+--------------------+---------------------------------+------+-----+----------------+-------------------+-----------------+

@norberttech norberttech merged commit b65cc2e into 1.x Jul 18, 2025
21 checks passed
@norberttech norberttech deleted the bug/parquet-balance-default-values branch July 18, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants