-
-
Notifications
You must be signed in to change notification settings - Fork 48
Improve the CSVExtractor by removing duplicated operations
#1665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.776mb -0.01% | 417.596ms -28.25% | ±0.57% -7.90% |
| ExcelExtractorBench | bench_extract_10k_ods | 1 | 3 | 65.486mb +0.00% | 1.044s -2.68% | ±0.65% +170.21% |
| ExcelExtractorBench | bench_extract_10k_xlsx | 1 | 3 | 67.532mb +0.00% | 1.670s -0.38% | ±0.20% -58.57% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 5.018mb +0.00% | 1.284s +0.31% | ±2.71% +475.46% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 86.321mb +0.00% | 921.218ms -0.33% | ±0.33% -14.05% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.499mb +0.01% | 38.380ms -1.73% | ±0.26% -76.39% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.494mb +0.01% | 604.170ms -0.01% | ±0.07% -77.08% |
+-----------------------+------------------------+------+-----+-----------------+-------------------+-----------------+
Transformers+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 123.236mb +0.00% | 66.514ms +1.53% | ±0.81% +4.55% |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 18.498mb +0.00% | 72.976ms -0.43% | ±0.17% -52.31% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench | bench_load_10k | 1 | 3 | 62.435mb -0.00% | 85.168ms -4.23% | ±0.94% +20.56% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 79.706mb +0.00% | 96.908ms -0.57% | ±1.09% -43.40% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 165.387mb +0.00% | 20.705s -0.73% | ±0.10% -69.05% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 17.805mb +0.00% | 30.994ms -2.80% | ±0.31% -0.25% |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
Building Blocks+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 101.784mb +0.00% | 648.727ms -1.19% | ±0.67% -43.26% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 53.134mb +0.00% | 329.581ms +1.98% | ±0.96% +130.07% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 14.384mb +0.00% | 68.681ms -5.04% | ±0.66% -76.11% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 93.389mb +0.00% | 3.516ms -5.99% | ±2.91% -16.13% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 110.758mb +0.00% | 235.154ms -0.44% | ±0.34% -67.90% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 93.478mb +0.00% | 23.450ms -3.43% | ±0.97% -25.93% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 94.264mb +0.00% | 1.689ms +4.65% | ±3.75% +5.03% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 94.264mb +0.00% | 1.583ms -6.10% | ±2.93% +30.87% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 92.424mb +0.00% | 3.431ms -6.01% | ±2.23% -1.16% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 92.953mb +0.00% | 16.304ms -2.26% | ±2.97% +138.88% |
| RowsBench | bench_find_on_10k | 2 | 3 | 92.953mb +0.00% | 15.472ms -3.45% | ±0.51% -23.66% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 91.642mb +0.00% | 2.000μs +0.30% | ±0.00% -100.00% |
| RowsBench | bench_first_on_10k | 10 | 3 | 91.642mb +0.00% | 0.400μs -20.00% | ±0.00% +0.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 100.703mb +0.00% | 14.648ms -7.26% | ±0.55% -48.40% |
| RowsBench | bench_map_on_10k | 2 | 3 | 130.130mb +0.00% | 67.425ms -3.83% | ±1.01% -19.47% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 93.473mb +0.00% | 1.526ms +1.88% | ±0.61% -78.41% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 96.841mb +0.00% | 62.472ms -1.39% | ±0.14% -82.36% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 94.526mb +0.00% | 4.166ms +8.75% | ±2.86% -18.40% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 92.003mb +0.00% | 39.951ms -2.00% | ±0.82% -73.92% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 92.004mb +0.00% | 40.135ms +0.25% | ±1.31% -43.94% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 92.003mb +0.00% | 39.377ms -4.71% | ±1.98% +25.67% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 94.085mb +0.00% | 8.285ms +0.49% | ±0.75% +35.89% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 91.835mb +0.00% | 29.668ms -0.04% | ±2.58% +57.62% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 91.642mb +0.00% | 14.382μs -2.03% | ±0.99% -14.08% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 91.642mb +0.00% | 17.290μs +4.49% | ±2.64% -18.17% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 110.759mb +0.00% | 239.544ms +1.49% | ±1.30% +221.91% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 42.070mb +0.00% | 430.178ms +2.05% | ±0.44% -33.82% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 11.448mb +0.00% | 85.267ms -0.21% | ±0.83% +57.51% |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## 1.x #1665 +/- ##
==========================================
- Coverage 82.08% 82.08% -0.01%
==========================================
Files 703 703
Lines 19064 19059 -5
==========================================
- Hits 15649 15644 -5
Misses 3415 3415
🚀 New features to boost your workflow:
|
|
what are the performance benefits of this? |
60b5542 to
d9d65d8
Compare
|
I would say it's worth considering ~25-30% of performance boost when reading 10k rows. |
|
Ha! It's indeed faster enough to make a difference! 🎉 So here are the results of following benchmark. Code used to generate benchmark dataset<?php
declare(strict_types=1);
use function Flow\ETL\DSL\from_array;
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\DSL\overwrite;
use function Flow\ETL\Adapter\CSV\to_csv;
use Faker\Factory;
use Flow\ETL\Rows;
include __DIR__ . '/../../../vendor/autoload.php';
$faker = Factory::create();
$skus = [
['sku' => 'SKU_0001', 'name' => 'Product 1', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0002', 'name' => 'Product 2', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0003', 'name' => 'Product 3', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0004', 'name' => 'Product 4', 'price' => $faker->randomFloat(2, 0, 500)],
['sku' => 'SKU_0005', 'name' => 'Product 5', 'price' => $faker->randomFloat(2, 0, 500)],
];
function generateOrders($faker, array $skus, int $count) : \Generator {
for ($i = 0; $i < $count; $i++) {
yield [
'order_id' => $faker->uuid,
'created_at' => $faker->dateTimeThisYear,
'updated_at' => \random_int(0, 1) === 1 ? $faker->dateTimeThisMonth : null,
'discount' => \random_int(0, 1) === 1 ? $faker->randomFloat(2, 0, 50) : null,
'email' => $faker->email,
'customer' => $faker->firstName . ' ' . $faker->lastName,
'address' => [
'street' => $faker->streetAddress,
'city' => $faker->city,
'zip' => $faker->postcode,
'country' => $faker->country,
],
'notes' => \array_map(
static fn($i) => $faker->sentence,
\range(1, $faker->numberBetween(1, 5))
),
'items' => \array_map(
static fn(int $index) => [
'sku' => $skus[$skuIndex = $faker->numberBetween(1, 4)]['sku'],
'quantity' => $faker->numberBetween(1, 10),
'price' => $skus[$skuIndex]['price']
],
\range(1, $faker->numberBetween(1, 4))
),
];
}
}
$ordersSchema = require __DIR__ . '/schema.php';
data_frame()
->read(from_array(generateOrders($faker, $skus, 1_000_000))->withSchema($ordersSchema))
->saveMode(overwrite())
->write(to_csv(__DIR__ . '/dataset/orders.csv'))
->batchSize(10_000)
->run(function (Rows $rows) {
echo "Generated {$rows->count()} rows\n";
});Schema Code<?php
use function Flow\ETL\DSL\schema;
use function Flow\ETL\DSL\uuid_schema;
use function Flow\ETL\DSL\datetime_schema;
use function Flow\ETL\DSL\float_schema;
use function Flow\ETL\DSL\str_schema;
use function Flow\ETL\DSL\struct_schema;
use function Flow\Types\DSL\type_structure;
use function Flow\Types\DSL\type_string;
use function Flow\Types\DSL\type_list;
use function Flow\ETL\DSL\list_schema;
use function Flow\Types\DSL\type_integer;
use function Flow\Types\DSL\type_float;
return schema(
uuid_schema('order_id'),
datetime_schema('created_at'),
datetime_schema('updated_at', true),
float_schema('discount', true),
str_schema('email'),
str_schema('customer'),
struct_schema(
'address',
type_structure([
'street' => type_string(),
'city' => type_string(),
'zip' => type_string(),
'country' => type_string(),
])
),
list_schema('notes', type_list(type_string())),
list_schema('items', type_list(
type_structure([
'sku' => type_string(),
'quantity' => type_integer(),
'price' => type_float(),
])
))
);Benchmark Code<?php
declare(strict_types=1);
use function Flow\ETL\DSL\data_frame;
use function Flow\ETL\Adapter\CSV\from_csv;
use Flow\ETL\Monitoring\Memory\Consumption;
include __DIR__ . '/../../../vendor/autoload.php';
$schema = require __DIR__ . '/schema.php';
$memory = new Consumption();
$report = data_frame()
->read(from_csv(__DIR__ . '/dataset/orders.csv')->withSchema($schema))
->run(function() use ($memory) {
$memory->current();
},analyze: true);
echo "Total rows: " . \number_format($report->statistics()->totalRows()) . "\n";
echo "Processing time : {$report->statistics()->executionTime->highResolutionTime->toString()}\n";
echo "Memory Max usage : {$memory->max()->inMb()}Mb\n";Benchmarks executed in nix-shell (the one from monorepo)
With following php.ini```ini date.timezone = UTC max_execution_time = 0 error_reporting = 0 display_errors = Off log_errors = Off opcache.enable = 0 opcache.enable_cli = 0 realpath_cache_size = 0 zend.assertions = -1 max_input_time = 3600 max_input_nesting_level = 64 memory_limit = -1 post_max_size = 200M upload_max_filesize = 150M file_uploads = On max_file_uploads = 20 short_open_tag = off ```To make sure results are stable I executed each benchmark 3 times on each branch. Branch
|
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description