-
-
Notifications
You must be signed in to change notification settings - Fork 48
Reimplemented sorting algorithms #1147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplemented sorting algorithms #1147
Conversation
|
some results of manual checks in local:
^ stats of tmp directory after this error: ^ If possible we should try to implement closing of handles to files which are not used at certain moment. Or advice to increase file system open files limit? Checked example file "/tmp/flow-php-external-sort/*/rows.php.cache"
which base64 decoded is for example: ^ the info about object properties etc is making serialized value to be ~x10 times bigger than original. Probably different serialization method should be used? |
|
thanks @domis86 this is super helpful, will take a look into this after weekend!! 🍻 |
|
I just pushed improved version, some answers below. It's because the default batch size that was used by the implementation is 1. This way external sort was creating as many buckets as many rows you tried to sort.
This was a side effect of batch size issue, it wont happen anymore.
Well, you are comparing plain CSV with a serialized collection of Row objects, it can't be smaller or even similar in size.
I don't think this have any relevant impact, sort cache is anyway purged immediately. |
|
Ok, I pushed what I believe can become a final version, some stats: Before: after: So it's not only consuming less memory but it's also significantly faster now. |
b2d5472 to
6add818
Compare
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 3.922mb +0.38% | 517.929ms +1.14% | ±0.77% +4.64% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 3.954mb +0.38% | 1.079s +0.99% | ±0.62% -25.65% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 28.515mb +0.05% | 439.460ms +3.19% | ±0.28% -3.24% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 3.681mb +0.41% | 34.545ms -0.71% | ±0.64% -61.80% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 3.628mb +0.41% | 445.429ms +0.28% | ±0.94% +47.84% |
+-----------------------+-------------------+------+-----+-----------------+------------------+----------------+
Transformers+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 115.976mb +0.05% | 59.443ms +0.11% | ±0.45% -3.32% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+---------------+
Loaders+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| CSVLoaderBench | bench_load_10k | 1 | 3 | 54.077mb +0.15% | 86.377ms +3.15% | ±1.16% +31.08% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 106.509mb +0.02% | 54.186ms +6.49% | ±0.49% +102.89% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 123.792mb +0.01% | 1.237s +0.74% | ±0.76% +127.93% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 16.931mb +0.13% | 43.142ms -0.18% | ±0.32% +70.47% |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
Building Blocks+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench | bench_type_detector | 1 | 3 | 59.706mb +0.01% | 451.808ms +6.19% | ±2.89% +628.17% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 14.245mb +0.04% | 88.029ms +2.87% | ±0.57% -83.17% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 116.524mb +0.00% | 509.572ms +2.41% | ±2.75% +82.50% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 60.002mb +0.01% | 253.368ms +2.01% | ±0.94% +96.39% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 14.936mb +0.04% | 54.598ms +2.32% | ±3.14% +191.22% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 86.794mb +0.01% | 3.597ms +5.40% | ±2.53% +111.33% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 102.392mb +0.01% | 189.250ms +0.69% | ±0.09% -93.88% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 85.112mb +0.01% | 19.075ms +0.92% | ±0.72% -55.51% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 88.034mb +0.01% | 1.878ms +7.95% | ±1.37% -60.86% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 88.034mb +0.01% | 1.793ms +5.60% | ±2.52% -17.94% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 85.146mb +0.01% | 2.876ms +9.86% | ±0.52% -57.77% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 85.675mb +0.01% | 15.233ms -5.54% | ±1.13% +49.62% |
| RowsBench | bench_find_on_10k | 2 | 3 | 85.675mb +0.01% | 14.961ms -6.75% | ±1.04% -14.74% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 83.579mb +0.01% | 1.894μs +17.92% | ±2.53% -12.50% |
| RowsBench | bench_first_on_10k | 10 | 3 | 83.579mb +0.01% | 0.300μs 0.00% | ±0.00% 0.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 92.929mb +0.01% | 12.549ms +4.32% | ±2.49% +171.00% |
| RowsBench | bench_map_on_10k | 2 | 3 | 122.300mb +0.00% | 63.602ms +4.46% | ±3.09% +168.58% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 86.195mb +0.01% | 1.600ms +35.31% | ±0.64% +45.88% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 89.541mb +0.01% | 64.764ms +3.34% | ±0.73% -60.78% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 88.296mb +0.01% | 4.043ms +5.81% | ±2.36% +677.89% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 83.723mb +0.01% | 40.121ms +3.29% | ±0.91% -22.39% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 83.723mb +0.01% | 40.060ms +3.66% | ±0.60% -43.95% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 83.723mb +0.01% | 39.267ms +1.45% | ±0.85% -62.50% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 86.021mb +0.01% | 7.539ms +3.28% | ±0.73% -13.81% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 83.579mb +0.01% | 30.432ms +7.54% | ±0.78% +43.36% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 83.579mb +0.01% | 14.242μs +4.57% | ±2.09% +66.24% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 83.579mb +0.01% | 16.876μs +6.94% | ±1.12% +4.99% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 102.393mb +0.01% | 191.713ms +0.09% | ±0.81% +2.06% |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
|
|
sorry for late, but i re-ran local checks (previously #1147 (comment) ) for external sort:
summary: great job 👍 ✔️ |
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description
Resolves: #1142
Resolves: #1035