Skip to content

Conversation

@mleczakm
Copy link
Contributor

Change Log

Added

  • UniqueFactory as source of random ints/strings with given length

Fixed

Changed

Removed

Deprecated

Security


Description

Solves #1124

@github-actions
Copy link
Contributor

github-actions bot commented Jul 20, 2024

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 3.937mb +0.21%  | 510.985ms +0.64% | ±2.89% +482.09% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 3.969mb +0.21%  | 1.058s -1.14%    | ±0.80% +16.37%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 28.530mb +0.03% | 422.100ms -3.32% | ±0.57% +11.36%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 3.696mb +0.22%  | 34.270ms +1.36%  | ±2.14% +137.70% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 3.643mb +0.22%  | 430.277ms -0.67% | ±0.29% -67.77%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.038mb +0.01% | 59.390ms -0.97% | ±0.37% -78.93% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev          |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.159mb +0.02%  | 83.817ms -4.08% | ±3.62% +59.10%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 106.526mb +0.01% | 51.253ms -2.13% | ±0.54% +2.92%   |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 123.808mb +0.01% | 1.230s -0.32%   | ±0.61% -79.90%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 16.953mb +0.05%  | 43.739ms -2.65% | ±1.10% +452.90% |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.711mb +0.01%  | 426.753ms -0.44% | ±0.28% -50.46%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.250mb +0.03%  | 85.273ms +0.65%  | ±1.05% +64.91%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.529mb +0.00% | 498.090ms -5.32% | ±0.91% -70.48%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 60.007mb +0.01%  | 248.786ms -1.16% | ±2.15% +466.60% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.941mb +0.03%  | 54.709ms +0.60%  | ±1.50% +859.45% |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 86.800mb +0.00%  | 3.226ms -6.76%   | ±3.19% +27.24%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.397mb +0.00% | 190.432ms +0.20% | ±1.01% -0.81%   |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.117mb +0.00%  | 19.307ms +3.46%  | ±1.69% +8.97%   |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.040mb +0.00%  | 1.578ms -4.87%   | ±1.15% -63.34%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.040mb +0.00%  | 1.635ms -9.12%   | ±1.52% +60.95%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.152mb +0.00%  | 2.497ms -4.79%   | ±0.43% -50.92%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 85.681mb +0.00%  | 14.853ms -0.19%  | ±2.46% +121.71% |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 85.681mb +0.00%  | 14.606ms -2.27%  | ±1.11% -4.98%   |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 83.584mb +0.00%  | 1.606μs -11.07%  | ±2.89% +12.24%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 83.584mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00% |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 92.935mb +0.00%  | 12.098ms -0.69%  | ±1.19% +186.43% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.305mb +0.00% | 60.230ms -2.24%  | ±0.35% -79.31%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.200mb +0.00%  | 1.231ms +0.60%   | ±1.89% +6.74%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 89.546mb +0.00%  | 62.001ms -4.43%  | ±1.21% +92.18%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.302mb +0.00%  | 3.825ms -1.45%   | ±0.64% -59.05%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 83.728mb +0.00%  | 39.437ms -3.62%  | ±0.53% -74.37%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 83.729mb +0.00%  | 38.985ms -2.36%  | ±1.76% +57.43%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 83.728mb +0.00%  | 40.850ms -0.26%  | ±1.75% +31.82%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.026mb +0.00%  | 7.344ms -0.05%   | ±0.58% -44.39%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 83.584mb +0.00%  | 29.063ms -1.90%  | ±0.55% -56.78%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 83.584mb +0.00%  | 13.288μs -1.37%  | ±1.95% -43.91%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 83.584mb +0.00%  | 16.010μs +0.44%  | ±2.76% -12.17%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.399mb +0.00% | 194.288ms +0.24% | ±0.40% -49.17%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

function (int $i) : array {
$data = [];

$maxItems = \random_int(2, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More I look at this, more I think I would add new function for unique int/string to "hide" object call & improve DX perspective, WDYT?

@norberttech
Copy link
Member

norberttech commented Jul 21, 2024

The PR look good and we could merge it like that (beside few small things @stloyd pointed out) but I think we might be able to still improve even more. Here are my thoughts.

Naming - now when I look at it something is missing in the name, like RandomValueGenerator would be more descriptive.

Dependency Injection - I think I would turn RandomValueGenerator into an interface with NativePHPRandomValueGenerator that we can simply inject into objects that would need a unique values (I'm thinking about caching strategies here and more advanced data processing algorithms).

Scalar Functions - Additionally (can go to a standalone PR) we could also create Flow scalar functions that could be used during data processing, like:

df()
   ->read(...) 
   ->withEntry('random_string', random_string(length: 16))
   ->write(to_output())
   ->run()

df()
   ->read(...) 
   ->withEntry('random_string', random_string(length: ref('some_int_column'))))
   ->write(to_output())
   ->run()

This function would implement Flow\ETL\Function\ScalarFunction and could take $lenght as another ScalarFunction so this way length could be taken from a column or output of chained functions (some calculations). Additionally it would also need RandomValueGenerator $generator = new NatievPHPRandomValueGenerator() in case anyone would like to change the generation strategy.

DSL Functions - as pointed by @stloyd we probably should also add generate_random_string(int $length = 16) and generate_random_int(int $start = PHP_INT_MIN, int $end = PHP_INT_MAX, RandomValueGenerator $generator = new NatievPHPRandomValueGenerator()) functions so we can use them in tests instead of static factories.

Thoughts?

Copy link
Member

@norberttech norberttech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes described above are required to merge this PR :)

@github-actions github-actions bot added size: L and removed size: M labels Jul 31, 2024
@norberttech norberttech merged commit 63ef619 into flow-php:1.x Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants