-
-
Notifications
You must be signed in to change notification settings - Fork 48
[Excel] Add a new Excel Adapter with an extractor #1643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.727mb +0.87% | 598.298ms -1.33% | ±0.59% +3.18% |
| ExcelExtractorBench | bench_extract_10k | 1 | 3 | ERR | 1.497s +0.00% | ±0.56% +0.00% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 4.971mb +0.82% | 1.278s +0.65% | ±0.67% +137.39% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 86.273mb +0.05% | 916.389ms +1.29% | ±1.45% -16.54% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.451mb +0.92% | 39.327ms +2.51% | ±0.24% +81.21% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.442mb +0.92% | 603.413ms -0.65% | ±0.28% -49.97% |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 123.187mb +0.03% | 64.774ms -0.74% | ±1.00% +287.99% |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 18.449mb +0.22% | 71.937ms -1.68% | ±0.75% -6.27% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
Loaders+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench | bench_load_10k | 1 | 3 | 62.386mb +0.07% | 86.491ms +1.28% | ±0.81% +43.94% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 79.658mb +0.05% | 95.552ms -1.49% | ±0.46% -47.30% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 165.338mb +0.03% | 20.798s +1.69% | ±0.10% -56.68% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 17.757mb +0.23% | 30.613ms +0.27% | ±1.24% +46.28% |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
Building Blocks+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 101.736mb +0.04% | 646.933ms +0.12% | ±0.07% -93.78% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 53.087mb +0.08% | 321.174ms -0.46% | ±2.10% +263.11% |
| EntryFactoryBench | bench_entry_factory | 1 | 3 | 14.337mb +0.29% | 69.136ms +0.76% | ±2.33% +148.48% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 93.341mb +0.04% | 3.458ms +9.86% | ±1.30% +256.56% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 110.710mb +0.04% | 240.130ms -0.36% | ±0.64% +55.06% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 93.430mb +0.04% | 24.176ms +2.67% | ±0.08% -81.95% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 94.216mb +0.04% | 1.296ms -1.23% | ±2.30% +134.64% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 94.216mb +0.04% | 1.310ms +3.68% | ±1.37% -49.96% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 92.376mb +0.04% | 3.414ms +3.89% | ±3.01% +77.30% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 92.905mb +0.04% | 15.470ms -0.30% | ±1.31% -60.44% |
| RowsBench | bench_find_on_10k | 2 | 3 | 92.905mb +0.04% | 15.278ms -1.03% | ±1.10% -26.68% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 91.594mb +0.04% | 1.794μs -0.67% | ±2.67% +3.77% |
| RowsBench | bench_first_on_10k | 10 | 3 | 91.594mb +0.04% | 0.400μs 0.00% | ±0.00% 0.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 100.655mb +0.04% | 14.024ms -0.51% | ±1.10% +223.29% |
| RowsBench | bench_map_on_10k | 2 | 3 | 130.082mb +0.03% | 66.642ms +0.77% | ±1.20% -32.07% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 93.425mb +0.04% | 1.159ms +1.17% | ±2.42% +33.92% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 96.793mb +0.04% | 62.339ms +2.34% | ±1.09% +454.51% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 94.478mb +0.04% | 3.507ms +4.88% | ±3.60% +199.47% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 91.955mb +0.04% | 40.353ms +2.28% | ±1.05% -13.62% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 91.956mb +0.04% | 40.447ms +1.61% | ±2.89% +403.76% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 91.955mb +0.04% | 39.940ms -1.08% | ±1.03% +34.39% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 94.037mb +0.04% | 8.152ms +1.06% | ±0.20% -60.52% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 91.787mb +0.04% | 29.473ms +0.16% | ±1.69% +23.31% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 91.594mb +0.04% | 13.972μs -4.07% | ±3.36% +15.25% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 91.594mb +0.04% | 15.957μs +0.65% | ±2.08% -26.16% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 110.711mb +0.04% | 243.629ms +1.02% | ±0.54% +100.79% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 42.022mb +0.10% | 427.126ms +1.65% | ±0.55% +18.70% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 11.400mb +0.36% | 85.881ms +1.05% | ±0.79% +53.52% |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
|
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## 1.x #1643 +/- ##
==========================================
+ Coverage 82.26% 82.32% +0.06%
==========================================
Files 694 701 +7
Lines 18852 18993 +141
==========================================
+ Hits 15508 15636 +128
- Misses 3344 3357 +13
🚀 New features to boost your workflow:
|
|
Shit, maybe reach out next time before you work on something, I hate to say it but I already took one attempt to build an Excel adapter (still have the codebase locally) and there is a roadmap item for it: #1500 There is one massive problem with Excel, it does not support remote filesystems :| The problem comes not really from openspout library but from ZipArchive that is used there under the hood. I did some initial research and it is possible to create an implementation of ZipArchive that supports reading/writing from/to remote locations and I believe that openspout authors would have nothing against introducing reading/writing from/to remote locations: |
|
You can check my progress here in case you would like to investigate this one more: 87ebf91 |
|
Remote support filesystem is nice, but I don't have use cases for remote filesystem, but a few with normal files. This solves a particular problem, similar to yours (while mine does full reading), and solves a particular issue: reading an Excel file, which is IMO a valid use case even without support for remote filesystem. I'm open to closing this and doing it in my project internally, but I guess that reading Excel files is quite a common case, which would be nice to cover. |
Totally, I meant more that if you would reach out before you could simply take over what I wrote and adjust it, saving yourself some typing. Let me think about and look at your code later, I will came up with a proper definition of done for this PR and then you can decide if you are willing to modify your code or drop it and implement internally. To your point, Excel Adapter is super valuable even without remote filesystem support and I would love to bring it to Flow. Although we could potentially support remote reading/writing by either doing it all in memory or buffering somewhere on the disk first before moving to remote storage - but that's out of the scope for now and not super critical as most of the use cases of this adapter are going to be anyway operating on local files only. |
src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/ExcelExtractor.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for definition of done for Extractors here is what I usually have in mind for new adapters:
Features
Below are the minimum features that should be supported by this extractor:
- limit number of extracting rows
- partitioning
- reading from xlsx/ods
- throw exception when path is not local fs
- make it possible to read data only from specific sheet (excel files are in most cases comes with multiple sheets and the interesting one is usually not the first one)
- support offset (reading from specific row), when withHeader is set to true, extractor should still read the first row but not consider is as a part of dataset. So when there are 10 rows and withHeader is true, we have only 9 rows.
Dependencies
When adding new dependency make sure that it's added to the composer.json of the adapter and to the composer.json of the monorepo. Use monorepo builder tool (from tools) to validate it.
As for the dependency version, try to support as wide range as possible. Try to not limit it jus to the latest version as many projects might not be able to upgrade and we still want them to use Flow.
Obviously there needs to be some balance, so if too many things changed between versions then drop the support for older ones. But if it's about few methods being renamed or moved to another NS - we cover that. If you wont be sure how to tackle this, just leave a comment with what would need to happen to support version X and I will let you know if it's worth it or not (most likely not).
Good example is Doctrine Dbal adapter where version 3.6 and 4 are supported. But again, that depends on how strict the external library is with semantic versioning.
Tests
Make sure that tests are executed at CI/CD: .github/workflows/test-suite.yml
Options and Configuration
Each extractor should have at least one test for each of his options, so here with that reader type you should put a test will read a fixture file as xlsx and ods formats.
Those tests should not only check how many rows are extracted but also check the values.
Remote Filesystem
There should be a test for using remote filesystems (when extractor does not support remote FS it should simply test that proper exception is thrown)
Partitioning
All extractors must be able to extract from partitioned datasets.
The test should not only cover read that throws no exception but should actually confirm that the data read from source is exactly how we expect it. This means that you need to prepare an excel files with let say 10 rows (or less), partition them manually (just create folders for partitions and put those files) and then read them and compare the results.
Limit
There need to be a test proving that limiting behavior is working as expected and that when we set limit to 5 on 10 rows dataset we only get 5 rows.
Benchmarks
Like most of the other extractors we should put benchmarks to monitor potentnial performance degradations.
For that we need a dataset with 10k rows, for excel you might be able to create one by converting src/adapter/etl-adapter-csv/tests/Flow/ETL/Adapter/CSV/Tests/Benchmark/Fixtures/orders_flow.csv into excel file.
CLI
New extractors that operate on files should also be added to the CLI, there are few commands that are critical:
- src/cli/src/Flow/CLI/Command/FileReadCommand.php
- src/cli/src/Flow/CLI/Command/FileSchemaCommand.php
- src/cli/src/Flow/CLI/Command/FileRowsCountCommand.php
- src/cli/src/Flow/CLI/Command/FileConvertCommand.php
- src/cli/src/Flow/CLI/Command/FileAnalyzeCommand.php
In this case you will need to also add trait ExcelOptions that would define options to pass into those commands.
Codebase
Try to use dsl functions across the codebase, especially for types, rows, row, schema, extractors, loaders, dataframe, transformations, scalar functions.
That applies also to tests (actually mostly to tests)
Documentation
As you pointed out many times, docs and examples sucks right now (I'm perfectly aware and constantly trying to find time and motivation to improve it).
Each new extractor should come with:
Examples
You should add an example under Data reading topic.
When extractor provides options they should be also documented in example description. (put a description.md file in the example folder)
On top of that it would be good to put some explanation in the dsl function docblock as those functions are all part of the documentation later.
Since this extractor can't read from remote filesystems it should also be mentioned in the example description and maybe also in dsl function docblocks.
Api References
Look at phpdoc folder and build:docs:api composer command.
Documentation
If this is a new adapter, make sure it's available on the website: web/landing/templates/documentation/navigation.html.twig
And that it comes with it's own readme file in documentation/components/adapters that is also listed in documentation/installation.md.
In that readme file there should be a simple example about how to use that extractor with Flow.
In order to generate commits stats new adapters/components needs to be also added to manifest.json as well as here: .github/workflows/monorepo-split.yml
I believe that should cover most of it, I already created a repo for this adapter http://github.com/flow-php/etl-adapter-excel
If I will think about anything else I will let you know as it's hard to write complete spec at one go but I really think that should be the most critical part.
As for Loader it should also be added but this can happen later
f040d58 to
537afc7
Compare
norberttech
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wen through this PR from my phone, will do my best to review it later from a laptop but its a bit busy weekend so no promises.
In general it looks good, maybe I missed it but I haven't seen any tests for CLI commands that would cover excel input? (review from phone is not the most convenient so I might just miss them).
src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/DSL/functions.php
Outdated
Show resolved
Hide resolved
src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/ExcelExtractor.php
Show resolved
Hide resolved
src/adapter/etl-adapter-excel/tests/Flow/ETL/Adapter/Excel/Tests/Unit/ExcelExtractorTest.php
Show resolved
Hide resolved
bfa9bb3 to
e8b2c43
Compare
|
I think the code looks great, but there are two small things still missing:
|
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description