Skip to content

Conversation

@stloyd
Copy link
Member

@stloyd stloyd commented May 10, 2025

Change Log

Added

  • Add a new Excel Adapter with simple extractor

Fixed

Changed

Removed

Deprecated

Security


Description

@stloyd stloyd requested a review from norberttech as a code owner May 10, 2025 10:29
@github-actions
Copy link
Contributor

github-actions bot commented May 10, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.727mb +0.87%  | 598.298ms -1.33% | ±0.59% +3.18%   |
| ExcelExtractorBench   | bench_extract_10k | 1    | 3   | ERR             | 1.497s +0.00%    | ±0.56% +0.00%   |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.971mb +0.82%  | 1.278s +0.65%    | ±0.67% +137.39% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 86.273mb +0.05% | 916.389ms +1.29% | ±1.45% -16.54%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.451mb +0.92%  | 39.327ms +2.51%  | ±0.24% +81.21%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.442mb +0.92%  | 603.413ms -0.65% | ±0.28% -49.97%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev          |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.187mb +0.03% | 64.774ms -0.74% | ±1.00% +287.99% |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.449mb +0.22%  | 71.937ms -1.68% | ±0.75% -6.27%   |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev         |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.386mb +0.07%  | 86.491ms +1.28% | ±0.81% +43.94% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 79.658mb +0.05%  | 95.552ms -1.49% | ±0.46% -47.30% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 165.338mb +0.03% | 20.798s +1.69%  | ±0.10% -56.68% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.757mb +0.23%  | 30.613ms +0.27% | ±1.24% +46.28% |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 101.736mb +0.04% | 646.933ms +0.12% | ±0.07% -93.78%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 53.087mb +0.08%  | 321.174ms -0.46% | ±2.10% +263.11% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.337mb +0.29%  | 69.136ms +0.76%  | ±2.33% +148.48% |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.341mb +0.04%  | 3.458ms +9.86%   | ±1.30% +256.56% |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.710mb +0.04% | 240.130ms -0.36% | ±0.64% +55.06%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.430mb +0.04%  | 24.176ms +2.67%  | ±0.08% -81.95%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.216mb +0.04%  | 1.296ms -1.23%   | ±2.30% +134.64% |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.216mb +0.04%  | 1.310ms +3.68%   | ±1.37% -49.96%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.376mb +0.04%  | 3.414ms +3.89%   | ±3.01% +77.30%  |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 92.905mb +0.04%  | 15.470ms -0.30%  | ±1.31% -60.44%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 92.905mb +0.04%  | 15.278ms -1.03%  | ±1.10% -26.68%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.594mb +0.04%  | 1.794μs -0.67%   | ±2.67% +3.77%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.594mb +0.04%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.655mb +0.04% | 14.024ms -0.51%  | ±1.10% +223.29% |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.082mb +0.03% | 66.642ms +0.77%  | ±1.20% -32.07%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.425mb +0.04%  | 1.159ms +1.17%   | ±2.42% +33.92%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.793mb +0.04%  | 62.339ms +2.34%  | ±1.09% +454.51% |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.478mb +0.04%  | 3.507ms +4.88%   | ±3.60% +199.47% |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 91.955mb +0.04%  | 40.353ms +2.28%  | ±1.05% -13.62%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 91.956mb +0.04%  | 40.447ms +1.61%  | ±2.89% +403.76% |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 91.955mb +0.04%  | 39.940ms -1.08%  | ±1.03% +34.39%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.037mb +0.04%  | 8.152ms +1.06%   | ±0.20% -60.52%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.787mb +0.04%  | 29.473ms +0.16%  | ±1.69% +23.31%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.594mb +0.04%  | 13.972μs -4.07%  | ±3.36% +15.25%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.594mb +0.04%  | 15.957μs +0.65%  | ±2.08% -26.16%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.711mb +0.04% | 243.629ms +1.02% | ±0.54% +100.79% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.022mb +0.10%  | 427.126ms +1.65% | ±0.55% +18.70%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.400mb +0.36%  | 85.881ms +1.05%  | ±0.79% +53.52%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@codecov
Copy link

codecov bot commented May 10, 2025

Codecov Report

Attention: Patch coverage is 92.05298% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.32%. Comparing base (850f2c5) to head (11f831a).
Report is 2 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1643      +/-   ##
==========================================
+ Coverage   82.26%   82.32%   +0.06%     
==========================================
  Files         694      701       +7     
  Lines       18852    18993     +141     
==========================================
+ Hits        15508    15636     +128     
- Misses       3344     3357      +13     
Components Coverage Δ
etl 88.04% <ø> (ø)
cli 84.42% <87.50%> (-0.17%) ⬇️
lib-array-dot 94.53% <ø> (ø)
lib-azure-sdk 62.56% <ø> (ø)
lib-doctrine-dbal-bulk 90.11% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-parquet 84.37% <ø> (ø)
lib-parquet-viewer 82.02% <ø> (ø)
lib-snappy 91.16% <ø> (ø)
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 96.38% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.45% <ø> (ø)
adapter-csv 89.57% <ø> (ø)
adapter-doctrine 89.69% <ø> (ø)
adapter-elasticsearch 97.19% <ø> (ø)
adapter-google-sheet 80.00% <ø> (ø)
adapter-http 59.15% <ø> (ø)
adapter-json 90.62% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.75% <ø> (ø)
adapter-parquet 78.42% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 83.15% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@norberttech
Copy link
Member

Shit, maybe reach out next time before you work on something, I hate to say it but I already took one attempt to build an Excel adapter (still have the codebase locally) and there is a roadmap item for it: #1500

There is one massive problem with Excel, it does not support remote filesystems :|

The problem comes not really from openspout library but from ZipArchive that is used there under the hood.
PHP implementation of ZipArchive makes it impossible to extract from archive in chunks. You need to pass entire zip archive at once, what means we can't read/write through byte range.

I did some initial research and it is possible to create an implementation of ZipArchive that supports reading/writing from/to remote locations and I believe that openspout authors would have nothing against introducing reading/writing from/to remote locations:

openspout/openspout#120 (comment)

@norberttech
Copy link
Member

You can check my progress here in case you would like to investigate this one more: 87ebf91

@stloyd
Copy link
Member Author

stloyd commented May 10, 2025

Remote support filesystem is nice, but I don't have use cases for remote filesystem, but a few with normal files.

This solves a particular problem, similar to yours (while mine does full reading), and solves a particular issue: reading an Excel file, which is IMO a valid use case even without support for remote filesystem.

I'm open to closing this and doing it in my project internally, but I guess that reading Excel files is quite a common case, which would be nice to cover.

@norberttech
Copy link
Member

reading an Excel file, which is IMO a valid use case even without support for remote filesystem.

Totally, I meant more that if you would reach out before you could simply take over what I wrote and adjust it, saving yourself some typing.

Let me think about and look at your code later, I will came up with a proper definition of done for this PR and then you can decide if you are willing to modify your code or drop it and implement internally.

To your point, Excel Adapter is super valuable even without remote filesystem support and I would love to bring it to Flow.

Although we could potentially support remote reading/writing by either doing it all in memory or buffering somewhere on the disk first before moving to remote storage - but that's out of the scope for now and not super critical as most of the use cases of this adapter are going to be anyway operating on local files only.

Copy link
Member

@norberttech norberttech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for definition of done for Extractors here is what I usually have in mind for new adapters:

Features

Below are the minimum features that should be supported by this extractor:

  • limit number of extracting rows
  • partitioning
  • reading from xlsx/ods
  • throw exception when path is not local fs
  • make it possible to read data only from specific sheet (excel files are in most cases comes with multiple sheets and the interesting one is usually not the first one)
  • support offset (reading from specific row), when withHeader is set to true, extractor should still read the first row but not consider is as a part of dataset. So when there are 10 rows and withHeader is true, we have only 9 rows.

Dependencies

When adding new dependency make sure that it's added to the composer.json of the adapter and to the composer.json of the monorepo. Use monorepo builder tool (from tools) to validate it.

As for the dependency version, try to support as wide range as possible. Try to not limit it jus to the latest version as many projects might not be able to upgrade and we still want them to use Flow.

Obviously there needs to be some balance, so if too many things changed between versions then drop the support for older ones. But if it's about few methods being renamed or moved to another NS - we cover that. If you wont be sure how to tackle this, just leave a comment with what would need to happen to support version X and I will let you know if it's worth it or not (most likely not).

Good example is Doctrine Dbal adapter where version 3.6 and 4 are supported. But again, that depends on how strict the external library is with semantic versioning.

Tests

Make sure that tests are executed at CI/CD: .github/workflows/test-suite.yml

Options and Configuration

Each extractor should have at least one test for each of his options, so here with that reader type you should put a test will read a fixture file as xlsx and ods formats.
Those tests should not only check how many rows are extracted but also check the values.

Remote Filesystem

There should be a test for using remote filesystems (when extractor does not support remote FS it should simply test that proper exception is thrown)

Partitioning

All extractors must be able to extract from partitioned datasets.

The test should not only cover read that throws no exception but should actually confirm that the data read from source is exactly how we expect it. This means that you need to prepare an excel files with let say 10 rows (or less), partition them manually (just create folders for partitions and put those files) and then read them and compare the results.

Limit

There need to be a test proving that limiting behavior is working as expected and that when we set limit to 5 on 10 rows dataset we only get 5 rows.

Benchmarks

Like most of the other extractors we should put benchmarks to monitor potentnial performance degradations.
For that we need a dataset with 10k rows, for excel you might be able to create one by converting src/adapter/etl-adapter-csv/tests/Flow/ETL/Adapter/CSV/Tests/Benchmark/Fixtures/orders_flow.csv into excel file.

CLI

New extractors that operate on files should also be added to the CLI, there are few commands that are critical:

  • src/cli/src/Flow/CLI/Command/FileReadCommand.php
  • src/cli/src/Flow/CLI/Command/FileSchemaCommand.php
  • src/cli/src/Flow/CLI/Command/FileRowsCountCommand.php
  • src/cli/src/Flow/CLI/Command/FileConvertCommand.php
  • src/cli/src/Flow/CLI/Command/FileAnalyzeCommand.php

In this case you will need to also add trait ExcelOptions that would define options to pass into those commands.

Codebase

Try to use dsl functions across the codebase, especially for types, rows, row, schema, extractors, loaders, dataframe, transformations, scalar functions.

That applies also to tests (actually mostly to tests)

Documentation

As you pointed out many times, docs and examples sucks right now (I'm perfectly aware and constantly trying to find time and motivation to improve it).

Each new extractor should come with:

Examples

You should add an example under Data reading topic.

When extractor provides options they should be also documented in example description. (put a description.md file in the example folder)

On top of that it would be good to put some explanation in the dsl function docblock as those functions are all part of the documentation later.

Since this extractor can't read from remote filesystems it should also be mentioned in the example description and maybe also in dsl function docblocks.

Api References

Look at phpdoc folder and build:docs:api composer command.

Documentation

If this is a new adapter, make sure it's available on the website: web/landing/templates/documentation/navigation.html.twig

And that it comes with it's own readme file in documentation/components/adapters that is also listed in documentation/installation.md.
In that readme file there should be a simple example about how to use that extractor with Flow.

In order to generate commits stats new adapters/components needs to be also added to manifest.json as well as here: .github/workflows/monorepo-split.yml


I believe that should cover most of it, I already created a repo for this adapter http://github.com/flow-php/etl-adapter-excel

If I will think about anything else I will let you know as it's hard to write complete spec at one go but I really think that should be the most critical part.

As for Loader it should also be added but this can happen later

@norberttech norberttech modified the milestone: 0.17.0 May 11, 2025
@norberttech norberttech moved this from Todo to In Progress in Roadmap May 17, 2025
@stloyd stloyd force-pushed the excel-adapter branch 4 times, most recently from f040d58 to 537afc7 Compare May 17, 2025 12:53
@stloyd stloyd requested a review from norberttech May 17, 2025 12:56
Copy link
Member

@norberttech norberttech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wen through this PR from my phone, will do my best to review it later from a laptop but its a bit busy weekend so no promises.

In general it looks good, maybe I missed it but I haven't seen any tests for CLI commands that would cover excel input? (review from phone is not the most convenient so I might just miss them).

@stloyd stloyd requested a review from norberttech May 17, 2025 16:10
@stloyd stloyd force-pushed the excel-adapter branch 3 times, most recently from bfa9bb3 to e8b2c43 Compare May 18, 2025 07:54
@stloyd stloyd changed the title Add a new Excel Adapter with simple extractor [Excel] Add a new Excel Adapter with simple extractor May 18, 2025
@stloyd stloyd added this to the 0.17.0 milestone May 18, 2025
@norberttech
Copy link
Member

I think the code looks great, but there are two small things still missing:

@stloyd stloyd changed the title [Excel] Add a new Excel Adapter with simple extractor [Excel] Add a new Excel Adapter with an extractor May 19, 2025
@norberttech norberttech merged commit 8ff01f8 into flow-php:1.x May 19, 2025
25 of 26 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Roadmap May 19, 2025
@stloyd stloyd deleted the excel-adapter branch May 19, 2025 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants