Skip to content

Conversation

@norberttech
Copy link
Member

@norberttech norberttech commented Jul 13, 2025

Resolves: #697

Change Log


Added

  • Support for Delta Binary Packed encoding in Parquet

Fixed

Changed

Removed

Deprecated

Security

Delta Binary Packed is applicable to Int32 and Int64 (which means that it also applies to Timestamps and other logical types stored as int32/64).

The goal is to significantly reduce output file size through storing only delta between integers in a column rather than integers themself.

This is super useful when we are dealing with incremental datasets, like for example orders.

@norberttech norberttech linked an issue Jul 13, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Jul 13, 2025

Codecov Report

Attention: Patch coverage is 85.54455% with 73 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (b58621d) to head (c2b4eb7).
Report is 5 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1764      +/-   ##
==========================================
+ Coverage   81.68%   81.79%   +0.10%     
==========================================
  Files         718      725       +7     
  Lines       20279    20771     +492     
==========================================
+ Hits        16565    16989     +424     
- Misses       3714     3782      +68     
Components Coverage Δ
etl 88.41% <ø> (ø)
cli 85.46% <ø> (ø)
lib-array-dot 94.56% <ø> (ø)
lib-azure-sdk 61.35% <ø> (ø)
lib-doctrine-dbal-bulk 93.88% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-types 53.55% <ø> (ø)
lib-parquet 85.56% <85.54%> (+0.10%) ⬆️
lib-parquet-viewer 83.11% <ø> (ø)
lib-snappy 89.76% <ø> (-0.47%) ⬇️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 97.04% <ø> (ø)
bridge-openapi-specification 93.16% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.70% <ø> (ø)
adapter-csv 88.85% <ø> (ø)
adapter-doctrine 89.89% <ø> (ø)
adapter-elasticsearch 97.23% <ø> (ø)
adapter-google-sheet 83.87% <ø> (ø)
adapter-http 58.10% <ø> (ø)
adapter-json 87.98% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.95% <ø> (ø)
adapter-parquet 78.92% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 82.73% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 13, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject                | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.865mb +0.08%  | 439.630ms +0.85% | ±0.30% -51.34%  |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.560mb +0.01% | 1.065s +1.43%    | ±0.45% -56.97%  |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.660mb +0.01% | 1.690s -0.42%    | ±0.38% -50.93%  |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.457mb +0.07%  | 1.143s +0.65%    | ±0.03% -97.41%  |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 10.647mb -0.03% | 9.321s -17.99%   | ±0.84% +143.97% |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.588mb +0.09%  | 42.683ms +1.99%  | ±1.27% +54.36%  |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.573mb +0.09%  | 607.992ms +2.22% | ±0.86% -45.52%  |
+-----------------------+------------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.584mb +0.02%  | 71.826ms -1.50% | ±0.30% -75.91% |
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.322mb +0.00% | 66.268ms -2.11% | ±1.14% -33.08% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev         |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.526mb +0.01%  | 83.876ms -3.58%  | ±1.59% -24.18% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 80.607mb +0.00%  | 100.408ms -5.25% | ±0.45% -84.68% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 835.122mb +0.03% | 18.928s -29.86%  | ±0.69% -18.79% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.890mb +0.02%  | 29.640ms -2.15%  | ±0.34% -67.49% |
+--------------------+----------------+------+-----+------------------+------------------+----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 106.004mb +0.00% | 644.658ms -1.88% | ±1.20% -36.05%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.279mb +0.01%  | 326.411ms -0.97% | ±2.84% +0.92%   |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.865mb +0.03%  | 68.935ms -2.60%  | ±0.95% -10.12%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.535mb +0.01%  | 403.678ms -1.69% | ±1.49% +39.23%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.592mb +0.03%  | 81.170ms -1.70%  | ±0.35% -80.85%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.475mb +0.00%  | 3.420ms -7.53%   | ±0.34% -75.11%  |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.846mb +0.00% | 237.985ms -0.51% | ±0.50% -22.15%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.566mb +0.00%  | 23.679ms -0.30%  | ±0.35% +15.16%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.350mb +0.00%  | 1.506ms -14.66%  | ±2.60% -26.56%  |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.350mb +0.00%  | 1.514ms -9.34%   | ±0.56% -78.74%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.510mb +0.00%  | 3.457ms +1.75%   | ±0.50% -84.70%  |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 93.039mb +0.00%  | 16.954ms +9.85%  | ±1.85% +50.93%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 93.039mb +0.00%  | 16.527ms +3.08%  | ±1.47% -39.39%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.728mb +0.00%  | 1.794μs -5.88%   | ±2.67% +9.43%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.728mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00% |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.789mb +0.00% | 14.569ms -1.25%  | ±0.47% -82.66%  |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.216mb +0.00% | 68.100ms -2.22%  | ±3.21% +214.92% |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.559mb +0.00%  | 1.348ms -2.22%   | ±1.37% -51.96%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.928mb +0.00%  | 61.820ms -1.61%  | ±0.62% -65.34%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.612mb +0.00%  | 3.666ms -6.65%   | ±1.33% -61.45%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.090mb +0.00%  | 39.279ms -1.82%  | ±1.42% +73.96%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.090mb +0.00%  | 39.311ms -2.32%  | ±0.98% +198.05% |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.090mb +0.00%  | 39.701ms -1.31%  | ±2.13% +47.75%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.171mb +0.00%  | 8.047ms -1.86%   | ±0.24% -84.05%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.921mb +0.00%  | 29.222ms -2.05%  | ±0.84% -50.79%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.728mb +0.00%  | 13.841μs -13.11% | ±1.82% +23.81%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.728mb +0.00%  | 15.543μs -5.99%  | ±2.11% -14.58%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.847mb +0.00% | 240.563ms -0.59% | ±0.13% -80.10%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
Parquet Library
+--------------------+---------------------------------+------+-----+----------------+-------------------+------------------+
| benchmark          | subject                         | revs | its | mem_peak       | mode              | rstdev           |
+--------------------+---------------------------------+------+-----+----------------+-------------------+------------------+
| ParquetReaderBench | bench_page_headers              | 1    | 3   | 6.656mb +0.07% | 3.339s +0.70%     | ±0.84% -29.29%   |
| ParquetReaderBench | bench_read_metadata             | 1    | 3   | 5.341mb +0.08% | 18.055ms -0.07%   | ±0.37% -25.06%   |
| ParquetReaderBench | bench_read_schema               | 1    | 3   | 5.341mb +0.08% | 18.122ms -0.32%   | ±0.27% +18.84%   |
| ParquetReaderBench | bench_read_values_all_columns   | 1    | 3   | 9.079mb -0.03% | 5.657s -28.22%    | ±0.68% -21.93%   |
| ParquetReaderBench | bench_read_values_single_column | 1    | 3   | 6.377mb -0.05% | 229.631ms -50.28% | ±0.83% +38.75%   |
| ParquetReaderBench | bench_read_values_with_limit    | 1    | 3   | 6.907mb -0.22% | 28.849ms -14.23%  | ±0.28% +27.19%   |
| ParquetWriterBench | bench_write_batch               | 1    | 3   | 9.823mb -3.46% | 164.869ms -17.17% | ±0.68% +4186.18% |
| ParquetWriterBench | bench_write_gzip                | 1    | 3   | 9.785mb +0.10% | 178.742ms +0.88%  | ±1.40% +110.14%  |
| ParquetWriterBench | bench_write_row_by_row          | 1    | 3   | 9.823mb -3.46% | 165.357ms -16.70% | ±1.11% +106.85%  |
| ParquetWriterBench | bench_write_snappy              | 1    | 3   | 9.823mb -3.46% | 164.906ms -17.58% | ±0.12% -41.53%   |
| ParquetWriterBench | bench_write_uncompressed        | 1    | 3   | 9.600mb +0.11% | 163.845ms -0.13%  | ±1.25% +366.20%  |
+--------------------+---------------------------------+------+-----+----------------+-------------------+------------------+

@norberttech norberttech merged commit fba7e26 into 1.x Jul 13, 2025
23 checks passed
@norberttech norberttech deleted the 697-parquet---implement-delta-encoding branch July 13, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet - Implement DELTA encoding

2 participants