Improve the `CSVExtractor` by removing duplicated operations #1665

stloyd · 2025-05-20T17:16:06Z

Change Log

Added

Fixed

Changed

Improve the `CSVExtractor` by removing duplicated operations

Removed

Deprecated

Security

Description

github-actions · 2025-05-20T17:19:55Z

Flow PHP - Benchmarks

^{_{Results of the benchmarks from this PR are compared with the results from 1.x branch.}}

Extractors

+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark             | subject                | revs | its | mem_peak        | mode              | rstdev          |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.776mb -0.01%  | 417.596ms -28.25% | ±0.57% -7.90%   |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.486mb +0.00% | 1.044s -2.68%     | ±0.65% +170.21% |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.532mb +0.00% | 1.670s -0.38%     | ±0.20% -58.57%  |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.018mb +0.00%  | 1.284s +0.31%     | ±2.71% +475.46% |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 86.321mb +0.00% | 921.218ms -0.33%  | ±0.33% -14.05%  |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.499mb +0.01%  | 38.380ms -1.73%   | ±0.26% -76.39%  |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.494mb +0.01%  | 604.170ms -0.01%  | ±0.07% -77.08%  |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+

Transformers

+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.236mb +0.00% | 66.514ms +1.53% | ±0.81% +4.55%  |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.498mb +0.00%  | 72.976ms -0.43% | ±0.17% -52.31% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+

Loaders

+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev         |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.435mb -0.00%  | 85.168ms -4.23% | ±0.94% +20.56% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 79.706mb +0.00%  | 96.908ms -0.57% | ±1.09% -43.40% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 165.387mb +0.00% | 20.705s -0.73%  | ±0.10% -69.05% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.805mb +0.00%  | 30.994ms -2.80% | ±0.31% -0.25%  |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+

Building Blocks

+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 101.784mb +0.00% | 648.727ms -1.19% | ±0.67% -43.26%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 53.134mb +0.00%  | 329.581ms +1.98% | ±0.96% +130.07% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.384mb +0.00%  | 68.681ms -5.04%  | ±0.66% -76.11%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.389mb +0.00%  | 3.516ms -5.99%   | ±2.91% -16.13%  |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.758mb +0.00% | 235.154ms -0.44% | ±0.34% -67.90%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.478mb +0.00%  | 23.450ms -3.43%  | ±0.97% -25.93%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.264mb +0.00%  | 1.689ms +4.65%   | ±3.75% +5.03%   |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.264mb +0.00%  | 1.583ms -6.10%   | ±2.93% +30.87%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.424mb +0.00%  | 3.431ms -6.01%   | ±2.23% -1.16%   |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 92.953mb +0.00%  | 16.304ms -2.26%  | ±2.97% +138.88% |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 92.953mb +0.00%  | 15.472ms -3.45%  | ±0.51% -23.66%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.642mb +0.00%  | 2.000μs +0.30%   | ±0.00% -100.00% |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.642mb +0.00%  | 0.400μs -20.00%  | ±0.00% +0.00%   |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.703mb +0.00% | 14.648ms -7.26%  | ±0.55% -48.40%  |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.130mb +0.00% | 67.425ms -3.83%  | ±1.01% -19.47%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.473mb +0.00%  | 1.526ms +1.88%   | ±0.61% -78.41%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.841mb +0.00%  | 62.472ms -1.39%  | ±0.14% -82.36%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.526mb +0.00%  | 4.166ms +8.75%   | ±2.86% -18.40%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.003mb +0.00%  | 39.951ms -2.00%  | ±0.82% -73.92%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.004mb +0.00%  | 40.135ms +0.25%  | ±1.31% -43.94%  |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.003mb +0.00%  | 39.377ms -4.71%  | ±1.98% +25.67%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.085mb +0.00%  | 8.285ms +0.49%   | ±0.75% +35.89%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.835mb +0.00%  | 29.668ms -0.04%  | ±2.58% +57.62%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.642mb +0.00%  | 14.382μs -2.03%  | ±0.99% -14.08%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.642mb +0.00%  | 17.290μs +4.49%  | ±2.64% -18.17%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.759mb +0.00% | 239.544ms +1.49% | ±1.30% +221.91% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.070mb +0.00%  | 430.178ms +2.05% | ±0.44% -33.82%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.448mb +0.00%  | 85.267ms -0.21%  | ±0.83% +57.51%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

codecov · 2025-05-20T17:23:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.08%. Comparing base (981b9ba) to head (d9d65d8).
Report is 2 commits behind head on 1.x.

Additional details and impacted files

@@            Coverage Diff             @@
##              1.x    #1665      +/-   ##
==========================================
- Coverage   82.08%   82.08%   -0.01%     
==========================================
  Files         703      703              
  Lines       19064    19059       -5     
==========================================
- Hits        15649    15644       -5     
  Misses       3415     3415

Components	Coverage Δ
etl	`88.27% <ø> (ø)`
cli	`84.42% <ø> (ø)`
lib-array-dot	`94.53% <ø> (ø)`
lib-azure-sdk	`62.56% <ø> (ø)`
lib-doctrine-dbal-bulk	`90.11% <ø> (ø)`
lib-filesystem	`78.02% <ø> (ø)`
lib-parquet	`84.37% <ø> (ø)`
lib-parquet-viewer	`82.02% <ø> (ø)`
lib-snappy	`90.69% <ø> (-0.47%)`	⬇️
bridge-filesystem-async-aws	`90.38% <ø> (ø)`
bridge-filesystem-azure	`89.92% <ø> (ø)`
bridge-monolog-http	`96.38% <ø> (ø)`
symfony-http-foundation	`74.41% <ø> (ø)`
adapter-chartjs	`86.45% <ø> (ø)`
adapter-csv	`90.18% <100.00%> (+0.18%)`	⬆️
adapter-doctrine	`89.69% <ø> (ø)`
adapter-elasticsearch	`97.19% <ø> (ø)`
adapter-google-sheet	`83.87% <ø> (ø)`
adapter-http	`59.15% <ø> (ø)`
adapter-json	`90.62% <ø> (ø)`
adapter-logger	`53.84% <ø> (ø)`
adapter-meilisearch	`97.75% <ø> (ø)`
adapter-parquet	`78.42% <ø> (ø)`
adapter-text	`84.44% <ø> (ø)`
adapter-xml	`83.15% <ø> (ø)`

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

norberttech · 2025-05-20T17:38:29Z

what are the performance benefits of this?

stloyd · 2025-05-20T17:51:35Z

I would say it's worth considering ~25-30% of performance boost when reading 10k rows.

norberttech · 2025-05-22T10:22:27Z

Ha! It's indeed faster enough to make a difference! 🎉

So here are the results of following benchmark.
(That's how from now I'm going to ask contributors to prep benchmark code for optimization pr's)

Code used to generate benchmark dataset

<?php

declare(strict_types=1);

use function Flow\ETL\DSL\from_array;
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\DSL\overwrite;
use function Flow\ETL\Adapter\CSV\to_csv;
use Faker\Factory;
use Flow\ETL\Rows;

include __DIR__ . '/../../../vendor/autoload.php';

$faker = Factory::create();

$skus = [
    ['sku' => 'SKU_0001', 'name' => 'Product 1', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0002', 'name' => 'Product 2', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0003', 'name' => 'Product 3', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0004', 'name' => 'Product 4', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0005', 'name' => 'Product 5', 'price' => $faker->randomFloat(2, 0, 500)],
];

function generateOrders($faker, array $skus, int $count) : \Generator {
    for ($i = 0; $i < $count; $i++) {
        yield [
            'order_id' => $faker->uuid,
            'created_at' => $faker->dateTimeThisYear,
            'updated_at' => \random_int(0, 1) === 1 ? $faker->dateTimeThisMonth : null,
            'discount' => \random_int(0, 1) === 1 ? $faker->randomFloat(2, 0, 50) : null,
            'email' => $faker->email,
            'customer' => $faker->firstName . ' ' . $faker->lastName,
            'address' => [
                'street' => $faker->streetAddress,
                'city' => $faker->city,
                'zip' => $faker->postcode,
                'country' => $faker->country,
            ],
            'notes' => \array_map(
                static fn($i) => $faker->sentence,
                \range(1, $faker->numberBetween(1, 5))
            ),
            'items' => \array_map(
                static fn(int $index) => [
                    'sku' => $skus[$skuIndex = $faker->numberBetween(1, 4)]['sku'],
                    'quantity' => $faker->numberBetween(1, 10),
                    'price' => $skus[$skuIndex]['price']
                ],
                \range(1, $faker->numberBetween(1, 4))
            ),
        ];
    }
}

$ordersSchema = require __DIR__ . '/schema.php';

data_frame()
    ->read(from_array(generateOrders($faker, $skus, 1_000_000))->withSchema($ordersSchema))
    ->saveMode(overwrite())
    ->write(to_csv(__DIR__ . '/dataset/orders.csv'))
    ->batchSize(10_000)
    ->run(function (Rows $rows) {
        echo "Generated {$rows->count()} rows\n";
    });

Schema Code

<?php

use function Flow\ETL\DSL\schema;
use function Flow\ETL\DSL\uuid_schema;
use function Flow\ETL\DSL\datetime_schema;
use function Flow\ETL\DSL\float_schema;
use function Flow\ETL\DSL\str_schema;
use function Flow\ETL\DSL\struct_schema;
use function Flow\Types\DSL\type_structure;
use function Flow\Types\DSL\type_string;
use function Flow\Types\DSL\type_list;
use function Flow\ETL\DSL\list_schema;
use function Flow\Types\DSL\type_integer;
use function Flow\Types\DSL\type_float;

return schema(
    uuid_schema('order_id'),
    datetime_schema('created_at'),
    datetime_schema('updated_at', true),
    float_schema('discount', true),
    str_schema('email'),
    str_schema('customer'),
    struct_schema(
        'address',
        type_structure([
            'street' => type_string(),
            'city' => type_string(),
            'zip' => type_string(),
            'country' => type_string(),
        ])
    ),
    list_schema('notes', type_list(type_string())),
    list_schema('items', type_list(
        type_structure([
            'sku' => type_string(),
            'quantity' => type_integer(),
            'price' => type_float(),
        ])
    ))
);

Benchmark Code

<?php

declare(strict_types=1);

use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\Adapter\CSV\from_csv;
use Flow\ETL\Monitoring\Memory\Consumption;

include __DIR__ . '/../../../vendor/autoload.php';

$schema = require __DIR__ . '/schema.php';

$memory = new Consumption();
$report = data_frame()
    ->read(from_csv(__DIR__ . '/dataset/orders.csv')->withSchema($schema))
    ->run(function() use ($memory) {
        $memory->current();
    },analyze: true);

echo "Total rows: " . \number_format($report->statistics()->totalRows()) . "\n";
echo "Processing time : {$report->statistics()->executionTime->highResolutionTime->toString()}\n";
echo "Memory Max usage : {$memory->max()->inMb()}Mb\n";

Benchmarks executed in nix-shell (the one from monorepo)

nix-shell --arg php-version 8.4 --arg with-pcov false --pure

With following php.ini

```ini date.timezone = UTC max_execution_time = 0 error_reporting = 0 display_errors = Off log_errors = Off opcache.enable = 0 opcache.enable_cli = 0 realpath_cache_size = 0 zend.assertions = -1 max_input_time = 3600 max_input_nesting_level = 64 memory_limit = -1 post_max_size = 200M upload_max_filesize = 150M file_uploads = On max_file_uploads = 20 short_open_tag = off ```

To make sure results are stable I executed each benchmark 3 times on each branch.
After switching to branch and opening nix-shell I also executed:

composer install --optimize-autoloader

Branch `1.x`

Execution `#1`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 35.955235s
Memory Max usage : 5.64Mb

Execution `#2`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 36.880880417s
Memory Max usage : 5.64Mb

Execution `#3`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 36.701315166s
Memory Max usage : 5.64Mb

Branch `stloyd/improve-csv-extractor`

Execution `#1`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 29.8215515s
Memory Max usage : 5.64Mb

Execution `#2`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 30.291809208s
Memory Max usage : 5.64Mb

Execution `#3`

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 29.585039959s
Memory Max usage : 5.64Mb

Blackfire Profile

(on 5k rows)

nix-shell --arg php-version 8.3 --arg with-pcov false --pure --arg with-blackfire true

`Branch 1.x`

https://blackfire.io/profiles/74c38883-740f-4256-b750-07c29059b0ae/graph

Wall Time     2.31s
I/O Wait     24.7ms
CPU Time      2.28s
Memory       6.48MB
Network         n/a     n/a     n/a
SQL             n/a     n/a

Branch `stloyd/improve-csv-extractor`

https://blackfire.io/profiles/c02de39f-a63d-4af2-8b13-cf80c997fa76/graph

Wall Time     2.33s
I/O Wait     25.1ms
CPU Time      2.31s
Memory       6.48MB
Network         n/a     n/a     n/a
SQL             n/a     n/a

Improve the CSVExtractor by removing duplicated operations

d9d65d8

github-actions bot added the size: XS label May 20, 2025

stloyd force-pushed the improve-csv-extractor branch from 60b5542 to d9d65d8 Compare May 20, 2025 17:50

stloyd marked this pull request as ready for review May 20, 2025 17:51

stloyd requested a review from norberttech as a code owner May 20, 2025 17:51

norberttech added the performance label May 22, 2025

norberttech added this to Roadmap May 22, 2025

github-project-automation bot moved this to Todo in Roadmap May 22, 2025

norberttech moved this from Todo to In Progress in Roadmap May 22, 2025

norberttech approved these changes May 22, 2025

View reviewed changes

norberttech merged commit 180f06c into flow-php:1.x May 22, 2025
24 checks passed

github-project-automation bot moved this from In Progress to Done in Roadmap May 22, 2025

stloyd deleted the improve-csv-extractor branch May 22, 2025 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve the `CSVExtractor` by removing duplicated operations #1665

Improve the `CSVExtractor` by removing duplicated operations #1665

Uh oh!

stloyd commented May 20, 2025

Uh oh!

github-actions bot commented May 20, 2025 •

edited

Loading

Uh oh!

codecov bot commented May 20, 2025 •

edited

Loading

Uh oh!

norberttech commented May 20, 2025

Uh oh!

stloyd commented May 20, 2025

Uh oh!

norberttech commented May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Improve the CSVExtractor by removing duplicated operations #1665

Improve the CSVExtractor by removing duplicated operations #1665

Uh oh!

Conversation

stloyd commented May 20, 2025

Change Log

Added

Fixed

Changed

Removed

Deprecated

Security

Description

Uh oh!

github-actions bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Flow PHP - Benchmarks

Uh oh!

codecov bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

norberttech commented May 20, 2025

Uh oh!

stloyd commented May 20, 2025

Uh oh!

norberttech commented May 22, 2025

Branch 1.x

Execution #1

Execution #2

Execution #3

Branch stloyd/improve-csv-extractor

Execution #1

Execution #2

Execution #3

Blackfire Profile

Branch 1.x

Branch stloyd/improve-csv-extractor

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve the `CSVExtractor` by removing duplicated operations #1665

Improve the `CSVExtractor` by removing duplicated operations #1665

github-actions bot commented May 20, 2025 •

edited

Loading

codecov bot commented May 20, 2025 •

edited

Loading

Branch `1.x`

Execution `#1`

Execution `#2`

Execution `#3`

Branch `stloyd/improve-csv-extractor`

Execution `#1`

Execution `#2`

Execution `#3`

`Branch 1.x`

Branch `stloyd/improve-csv-extractor`