Skip to content

Conversation

@stloyd
Copy link
Member

@stloyd stloyd commented May 20, 2025

Change Log

Added

Fixed

Changed

  • Improve the `CSVExtractor` by removing duplicated operations

Removed

Deprecated

Security


Description

@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark             | subject                | revs | its | mem_peak        | mode              | rstdev          |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.776mb -0.01%  | 417.596ms -28.25% | ±0.57% -7.90%   |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.486mb +0.00% | 1.044s -2.68%     | ±0.65% +170.21% |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.532mb +0.00% | 1.670s -0.38%     | ±0.20% -58.57%  |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.018mb +0.00%  | 1.284s +0.31%     | ±2.71% +475.46% |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 86.321mb +0.00% | 921.218ms -0.33%  | ±0.33% -14.05%  |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.499mb +0.01%  | 38.380ms -1.73%   | ±0.26% -76.39%  |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.494mb +0.01%  | 604.170ms -0.01%  | ±0.07% -77.08%  |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.236mb +0.00% | 66.514ms +1.53% | ±0.81% +4.55%  |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.498mb +0.00%  | 72.976ms -0.43% | ±0.17% -52.31% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev         |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.435mb -0.00%  | 85.168ms -4.23% | ±0.94% +20.56% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 79.706mb +0.00%  | 96.908ms -0.57% | ±1.09% -43.40% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 165.387mb +0.00% | 20.705s -0.73%  | ±0.10% -69.05% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.805mb +0.00%  | 30.994ms -2.80% | ±0.31% -0.25%  |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 101.784mb +0.00% | 648.727ms -1.19% | ±0.67% -43.26%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 53.134mb +0.00%  | 329.581ms +1.98% | ±0.96% +130.07% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.384mb +0.00%  | 68.681ms -5.04%  | ±0.66% -76.11%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.389mb +0.00%  | 3.516ms -5.99%   | ±2.91% -16.13%  |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.758mb +0.00% | 235.154ms -0.44% | ±0.34% -67.90%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.478mb +0.00%  | 23.450ms -3.43%  | ±0.97% -25.93%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.264mb +0.00%  | 1.689ms +4.65%   | ±3.75% +5.03%   |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.264mb +0.00%  | 1.583ms -6.10%   | ±2.93% +30.87%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.424mb +0.00%  | 3.431ms -6.01%   | ±2.23% -1.16%   |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 92.953mb +0.00%  | 16.304ms -2.26%  | ±2.97% +138.88% |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 92.953mb +0.00%  | 15.472ms -3.45%  | ±0.51% -23.66%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.642mb +0.00%  | 2.000μs +0.30%   | ±0.00% -100.00% |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.642mb +0.00%  | 0.400μs -20.00%  | ±0.00% +0.00%   |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.703mb +0.00% | 14.648ms -7.26%  | ±0.55% -48.40%  |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.130mb +0.00% | 67.425ms -3.83%  | ±1.01% -19.47%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.473mb +0.00%  | 1.526ms +1.88%   | ±0.61% -78.41%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.841mb +0.00%  | 62.472ms -1.39%  | ±0.14% -82.36%  |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.526mb +0.00%  | 4.166ms +8.75%   | ±2.86% -18.40%  |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.003mb +0.00%  | 39.951ms -2.00%  | ±0.82% -73.92%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.004mb +0.00%  | 40.135ms +0.25%  | ±1.31% -43.94%  |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.003mb +0.00%  | 39.377ms -4.71%  | ±1.98% +25.67%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.085mb +0.00%  | 8.285ms +0.49%   | ±0.75% +35.89%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.835mb +0.00%  | 29.668ms -0.04%  | ±2.58% +57.62%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.642mb +0.00%  | 14.382μs -2.03%  | ±0.99% -14.08%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.642mb +0.00%  | 17.290μs +4.49%  | ±2.64% -18.17%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.759mb +0.00% | 239.544ms +1.49% | ±1.30% +221.91% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.070mb +0.00%  | 430.178ms +2.05% | ±0.44% -33.82%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.448mb +0.00%  | 85.267ms -0.21%  | ±0.83% +57.51%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@codecov
Copy link

codecov bot commented May 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.08%. Comparing base (981b9ba) to head (d9d65d8).
Report is 2 commits behind head on 1.x.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1665      +/-   ##
==========================================
- Coverage   82.08%   82.08%   -0.01%     
==========================================
  Files         703      703              
  Lines       19064    19059       -5     
==========================================
- Hits        15649    15644       -5     
  Misses       3415     3415              
Components Coverage Δ
etl 88.27% <ø> (ø)
cli 84.42% <ø> (ø)
lib-array-dot 94.53% <ø> (ø)
lib-azure-sdk 62.56% <ø> (ø)
lib-doctrine-dbal-bulk 90.11% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-parquet 84.37% <ø> (ø)
lib-parquet-viewer 82.02% <ø> (ø)
lib-snappy 90.69% <ø> (-0.47%) ⬇️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 96.38% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.45% <ø> (ø)
adapter-csv 90.18% <100.00%> (+0.18%) ⬆️
adapter-doctrine 89.69% <ø> (ø)
adapter-elasticsearch 97.19% <ø> (ø)
adapter-google-sheet 83.87% <ø> (ø)
adapter-http 59.15% <ø> (ø)
adapter-json 90.62% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.75% <ø> (ø)
adapter-parquet 78.42% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 83.15% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@norberttech
Copy link
Member

what are the performance benefits of this?

@stloyd stloyd force-pushed the improve-csv-extractor branch from 60b5542 to d9d65d8 Compare May 20, 2025 17:50
@stloyd
Copy link
Member Author

stloyd commented May 20, 2025

I would say it's worth considering ~25-30% of performance boost when reading 10k rows.

@stloyd stloyd marked this pull request as ready for review May 20, 2025 17:51
@stloyd stloyd requested a review from norberttech as a code owner May 20, 2025 17:51
@norberttech
Copy link
Member

Ha! It's indeed faster enough to make a difference! 🎉

So here are the results of following benchmark.
(That's how from now I'm going to ask contributors to prep benchmark code for optimization pr's)

Code used to generate benchmark dataset
<?php

declare(strict_types=1);

use function Flow\ETL\DSL\from_array;
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\DSL\overwrite;
use function Flow\ETL\Adapter\CSV\to_csv;
use Faker\Factory;
use Flow\ETL\Rows;

include __DIR__ . '/../../../vendor/autoload.php';

$faker = Factory::create();

$skus = [
    ['sku' => 'SKU_0001', 'name' => 'Product 1', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0002', 'name' => 'Product 2', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0003', 'name' => 'Product 3', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0004', 'name' => 'Product 4', 'price' => $faker->randomFloat(2, 0, 500)],
    ['sku' => 'SKU_0005', 'name' => 'Product 5', 'price' => $faker->randomFloat(2, 0, 500)],
];

function generateOrders($faker, array $skus, int $count) : \Generator {
    for ($i = 0; $i < $count; $i++) {
        yield [
            'order_id' => $faker->uuid,
            'created_at' => $faker->dateTimeThisYear,
            'updated_at' => \random_int(0, 1) === 1 ? $faker->dateTimeThisMonth : null,
            'discount' => \random_int(0, 1) === 1 ? $faker->randomFloat(2, 0, 50) : null,
            'email' => $faker->email,
            'customer' => $faker->firstName . ' ' . $faker->lastName,
            'address' => [
                'street' => $faker->streetAddress,
                'city' => $faker->city,
                'zip' => $faker->postcode,
                'country' => $faker->country,
            ],
            'notes' => \array_map(
                static fn($i) => $faker->sentence,
                \range(1, $faker->numberBetween(1, 5))
            ),
            'items' => \array_map(
                static fn(int $index) => [
                    'sku' => $skus[$skuIndex = $faker->numberBetween(1, 4)]['sku'],
                    'quantity' => $faker->numberBetween(1, 10),
                    'price' => $skus[$skuIndex]['price']
                ],
                \range(1, $faker->numberBetween(1, 4))
            ),
        ];
    }
}

$ordersSchema = require __DIR__ . '/schema.php';

data_frame()
    ->read(from_array(generateOrders($faker, $skus, 1_000_000))->withSchema($ordersSchema))
    ->saveMode(overwrite())
    ->write(to_csv(__DIR__ . '/dataset/orders.csv'))
    ->batchSize(10_000)
    ->run(function (Rows $rows) {
        echo "Generated {$rows->count()} rows\n";
    });
Schema Code
<?php

use function Flow\ETL\DSL\schema;
use function Flow\ETL\DSL\uuid_schema;
use function Flow\ETL\DSL\datetime_schema;
use function Flow\ETL\DSL\float_schema;
use function Flow\ETL\DSL\str_schema;
use function Flow\ETL\DSL\struct_schema;
use function Flow\Types\DSL\type_structure;
use function Flow\Types\DSL\type_string;
use function Flow\Types\DSL\type_list;
use function Flow\ETL\DSL\list_schema;
use function Flow\Types\DSL\type_integer;
use function Flow\Types\DSL\type_float;

return schema(
    uuid_schema('order_id'),
    datetime_schema('created_at'),
    datetime_schema('updated_at', true),
    float_schema('discount', true),
    str_schema('email'),
    str_schema('customer'),
    struct_schema(
        'address',
        type_structure([
            'street' => type_string(),
            'city' => type_string(),
            'zip' => type_string(),
            'country' => type_string(),
        ])
    ),
    list_schema('notes', type_list(type_string())),
    list_schema('items', type_list(
        type_structure([
            'sku' => type_string(),
            'quantity' => type_integer(),
            'price' => type_float(),
        ])
    ))
);
Benchmark Code
<?php

declare(strict_types=1);

use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\Adapter\CSV\from_csv;
use Flow\ETL\Monitoring\Memory\Consumption;

include __DIR__ . '/../../../vendor/autoload.php';

$schema = require __DIR__ . '/schema.php';

$memory = new Consumption();
$report = data_frame()
    ->read(from_csv(__DIR__ . '/dataset/orders.csv')->withSchema($schema))
    ->run(function() use ($memory) {
        $memory->current();
    },analyze: true);

echo "Total rows: " . \number_format($report->statistics()->totalRows()) . "\n";
echo "Processing time : {$report->statistics()->executionTime->highResolutionTime->toString()}\n";
echo "Memory Max usage : {$memory->max()->inMb()}Mb\n";

Benchmarks executed in nix-shell (the one from monorepo)

nix-shell --arg php-version 8.4 --arg with-pcov false --pure

With following php.ini ```ini date.timezone = UTC max_execution_time = 0 error_reporting = 0 display_errors = Off log_errors = Off opcache.enable = 0 opcache.enable_cli = 0 realpath_cache_size = 0 zend.assertions = -1 max_input_time = 3600 max_input_nesting_level = 64 memory_limit = -1 post_max_size = 200M upload_max_filesize = 150M file_uploads = On max_file_uploads = 20 short_open_tag = off ```

To make sure results are stable I executed each benchmark 3 times on each branch.
After switching to branch and opening nix-shell I also executed:

composer install --optimize-autoloader

Branch 1.x

Execution #1

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 35.955235s
Memory Max usage : 5.64Mb

Execution #2

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 36.880880417s
Memory Max usage : 5.64Mb

Execution #3

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 36.701315166s
Memory Max usage : 5.64Mb

Branch stloyd/improve-csv-extractor

Execution #1

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 29.8215515s
Memory Max usage : 5.64Mb

Execution #2

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 30.291809208s
Memory Max usage : 5.64Mb

Execution #3

❯ php .scratchpad/performance/csv-operations/benchmark.php 
Total rows: 1,000,000
Processing time : 29.585039959s
Memory Max usage : 5.64Mb

Blackfire Profile

(on 5k rows)

nix-shell --arg php-version 8.3 --arg with-pcov false --pure --arg with-blackfire true

Branch 1.x

https://blackfire.io/profiles/74c38883-740f-4256-b750-07c29059b0ae/graph

Wall Time     2.31s
I/O Wait     24.7ms
CPU Time      2.28s
Memory       6.48MB
Network         n/a     n/a     n/a
SQL             n/a     n/a

Branch stloyd/improve-csv-extractor

https://blackfire.io/profiles/c02de39f-a63d-4af2-8b13-cf80c997fa76/graph

Wall Time     2.33s
I/O Wait     25.1ms
CPU Time      2.31s
Memory       6.48MB
Network         n/a     n/a     n/a
SQL             n/a     n/a

@norberttech norberttech moved this from Todo to In Progress in Roadmap May 22, 2025
@norberttech norberttech merged commit 180f06c into flow-php:1.x May 22, 2025
24 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap May 22, 2025
@stloyd stloyd deleted the improve-csv-extractor branch May 22, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants