[Excel] Add a new Excel Adapter with an extractor #1643

stloyd · 2025-05-10T10:29:26Z

Change Log

Added

Add a new Excel Adapter with simple extractor

Fixed

Changed

Removed

Deprecated

Security

Description

github-actions · 2025-05-10T10:33:26Z

Flow PHP - Benchmarks

^{_{Results of the benchmarks from this PR are compared with the results from 1.x branch.}}

Extractors

+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.727mb +0.87%  | 598.298ms -1.33% | ±0.59% +3.18%   |
| ExcelExtractorBench   | bench_extract_10k | 1    | 3   | ERR             | 1.497s +0.00%    | ±0.56% +0.00%   |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.971mb +0.82%  | 1.278s +0.65%    | ±0.67% +137.39% |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 86.273mb +0.05% | 916.389ms +1.29% | ±1.45% -16.54%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.451mb +0.92%  | 39.327ms +2.51%  | ±0.24% +81.21%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.442mb +0.92%  | 603.413ms -0.65% | ±0.28% -49.97%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+

Transformers

+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev          |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.187mb +0.03% | 64.774ms -0.74% | ±1.00% +287.99% |
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.449mb +0.22%  | 71.937ms -1.68% | ±0.75% -6.27%   |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+-----------------+

Loaders

+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev         |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.386mb +0.07%  | 86.491ms +1.28% | ±0.81% +43.94% |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 79.658mb +0.05%  | 95.552ms -1.49% | ±0.46% -47.30% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 165.338mb +0.03% | 20.798s +1.69%  | ±0.10% -56.68% |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.757mb +0.23%  | 30.613ms +0.27% | ±1.24% +46.28% |
+--------------------+----------------+------+-----+------------------+-----------------+----------------+

Building Blocks

+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 101.736mb +0.04% | 646.933ms +0.12% | ±0.07% -93.78%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 53.087mb +0.08%  | 321.174ms -0.46% | ±2.10% +263.11% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.337mb +0.29%  | 69.136ms +0.76%  | ±2.33% +148.48% |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.341mb +0.04%  | 3.458ms +9.86%   | ±1.30% +256.56% |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.710mb +0.04% | 240.130ms -0.36% | ±0.64% +55.06%  |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.430mb +0.04%  | 24.176ms +2.67%  | ±0.08% -81.95%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.216mb +0.04%  | 1.296ms -1.23%   | ±2.30% +134.64% |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.216mb +0.04%  | 1.310ms +3.68%   | ±1.37% -49.96%  |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.376mb +0.04%  | 3.414ms +3.89%   | ±3.01% +77.30%  |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 92.905mb +0.04%  | 15.470ms -0.30%  | ±1.31% -60.44%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 92.905mb +0.04%  | 15.278ms -1.03%  | ±1.10% -26.68%  |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.594mb +0.04%  | 1.794μs -0.67%   | ±2.67% +3.77%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.594mb +0.04%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.655mb +0.04% | 14.024ms -0.51%  | ±1.10% +223.29% |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.082mb +0.03% | 66.642ms +0.77%  | ±1.20% -32.07%  |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.425mb +0.04%  | 1.159ms +1.17%   | ±2.42% +33.92%  |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.793mb +0.04%  | 62.339ms +2.34%  | ±1.09% +454.51% |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.478mb +0.04%  | 3.507ms +4.88%   | ±3.60% +199.47% |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 91.955mb +0.04%  | 40.353ms +2.28%  | ±1.05% -13.62%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 91.956mb +0.04%  | 40.447ms +1.61%  | ±2.89% +403.76% |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 91.955mb +0.04%  | 39.940ms -1.08%  | ±1.03% +34.39%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.037mb +0.04%  | 8.152ms +1.06%   | ±0.20% -60.52%  |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.787mb +0.04%  | 29.473ms +0.16%  | ±1.69% +23.31%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.594mb +0.04%  | 13.972μs -4.07%  | ±3.36% +15.25%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.594mb +0.04%  | 15.957μs +0.65%  | ±2.08% -26.16%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.711mb +0.04% | 243.629ms +1.02% | ±0.54% +100.79% |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.022mb +0.10%  | 427.126ms +1.65% | ±0.55% +18.70%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.400mb +0.36%  | 85.881ms +1.05%  | ±0.79% +53.52%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

codecov · 2025-05-10T10:36:51Z

Codecov Report

Attention: Patch coverage is 92.05298% with 12 lines in your changes missing coverage. Please review.

Project coverage is 82.32%. Comparing base (850f2c5) to head (11f831a).
Report is 2 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##              1.x    #1643      +/-   ##
==========================================
+ Coverage   82.26%   82.32%   +0.06%     
==========================================
  Files         694      701       +7     
  Lines       18852    18993     +141     
==========================================
+ Hits        15508    15636     +128     
- Misses       3344     3357      +13

Components	Coverage Δ
etl	`88.04% <ø> (ø)`
cli	`84.42% <87.50%> (-0.17%)`	⬇️
lib-array-dot	`94.53% <ø> (ø)`
lib-azure-sdk	`62.56% <ø> (ø)`
lib-doctrine-dbal-bulk	`90.11% <ø> (ø)`
lib-filesystem	`78.02% <ø> (ø)`
lib-parquet	`84.37% <ø> (ø)`
lib-parquet-viewer	`82.02% <ø> (ø)`
lib-snappy	`91.16% <ø> (ø)`
bridge-filesystem-async-aws	`90.38% <ø> (ø)`
bridge-filesystem-azure	`89.92% <ø> (ø)`
bridge-monolog-http	`96.38% <ø> (ø)`
symfony-http-foundation	`74.41% <ø> (ø)`
adapter-chartjs	`86.45% <ø> (ø)`
adapter-csv	`89.57% <ø> (ø)`
adapter-doctrine	`89.69% <ø> (ø)`
adapter-elasticsearch	`97.19% <ø> (ø)`
adapter-google-sheet	`80.00% <ø> (ø)`
adapter-http	`59.15% <ø> (ø)`
adapter-json	`90.62% <ø> (ø)`
adapter-logger	`53.84% <ø> (ø)`
adapter-meilisearch	`97.75% <ø> (ø)`
adapter-parquet	`78.42% <ø> (ø)`
adapter-text	`84.44% <ø> (ø)`
adapter-xml	`83.15% <ø> (ø)`

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

norberttech · 2025-05-10T10:37:38Z

Shit, maybe reach out next time before you work on something, I hate to say it but I already took one attempt to build an Excel adapter (still have the codebase locally) and there is a roadmap item for it: #1500

There is one massive problem with Excel, it does not support remote filesystems :|

The problem comes not really from openspout library but from ZipArchive that is used there under the hood.
PHP implementation of ZipArchive makes it impossible to extract from archive in chunks. You need to pass entire zip archive at once, what means we can't read/write through byte range.

I did some initial research and it is possible to create an implementation of ZipArchive that supports reading/writing from/to remote locations and I believe that openspout authors would have nothing against introducing reading/writing from/to remote locations:

openspout/openspout#120 (comment)

norberttech · 2025-05-10T10:39:31Z

You can check my progress here in case you would like to investigate this one more: 87ebf91

stloyd · 2025-05-10T10:47:52Z

Remote support filesystem is nice, but I don't have use cases for remote filesystem, but a few with normal files.

This solves a particular problem, similar to yours (while mine does full reading), and solves a particular issue: reading an Excel file, which is IMO a valid use case even without support for remote filesystem.

I'm open to closing this and doing it in my project internally, but I guess that reading Excel files is quite a common case, which would be nice to cover.

norberttech · 2025-05-10T10:58:59Z

reading an Excel file, which is IMO a valid use case even without support for remote filesystem.

Totally, I meant more that if you would reach out before you could simply take over what I wrote and adjust it, saving yourself some typing.

Let me think about and look at your code later, I will came up with a proper definition of done for this PR and then you can decide if you are willing to modify your code or drop it and implement internally.

To your point, Excel Adapter is super valuable even without remote filesystem support and I would love to bring it to Flow.

Although we could potentially support remote reading/writing by either doing it all in memory or buffering somewhere on the disk first before moving to remote storage - but that's out of the scope for now and not super critical as most of the use cases of this adapter are going to be anyway operating on local files only.

src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/ExcelExtractor.php

norberttech

As for definition of done for Extractors here is what I usually have in mind for new adapters:

Features

Below are the minimum features that should be supported by this extractor:

limit number of extracting rows
partitioning
reading from xlsx/ods
throw exception when path is not local fs
make it possible to read data only from specific sheet (excel files are in most cases comes with multiple sheets and the interesting one is usually not the first one)
support offset (reading from specific row), when withHeader is set to true, extractor should still read the first row but not consider is as a part of dataset. So when there are 10 rows and withHeader is true, we have only 9 rows.

Dependencies

When adding new dependency make sure that it's added to the composer.json of the adapter and to the composer.json of the monorepo. Use monorepo builder tool (from tools) to validate it.

As for the dependency version, try to support as wide range as possible. Try to not limit it jus to the latest version as many projects might not be able to upgrade and we still want them to use Flow.

Obviously there needs to be some balance, so if too many things changed between versions then drop the support for older ones. But if it's about few methods being renamed or moved to another NS - we cover that. If you wont be sure how to tackle this, just leave a comment with what would need to happen to support version X and I will let you know if it's worth it or not (most likely not).

Good example is Doctrine Dbal adapter where version 3.6 and 4 are supported. But again, that depends on how strict the external library is with semantic versioning.

Tests

Make sure that tests are executed at CI/CD: .github/workflows/test-suite.yml

Options and Configuration

Each extractor should have at least one test for each of his options, so here with that reader type you should put a test will read a fixture file as xlsx and ods formats.
Those tests should not only check how many rows are extracted but also check the values.

Remote Filesystem

There should be a test for using remote filesystems (when extractor does not support remote FS it should simply test that proper exception is thrown)

Partitioning

All extractors must be able to extract from partitioned datasets.

The test should not only cover read that throws no exception but should actually confirm that the data read from source is exactly how we expect it. This means that you need to prepare an excel files with let say 10 rows (or less), partition them manually (just create folders for partitions and put those files) and then read them and compare the results.

Limit

There need to be a test proving that limiting behavior is working as expected and that when we set limit to 5 on 10 rows dataset we only get 5 rows.

Benchmarks

Like most of the other extractors we should put benchmarks to monitor potentnial performance degradations.
For that we need a dataset with 10k rows, for excel you might be able to create one by converting src/adapter/etl-adapter-csv/tests/Flow/ETL/Adapter/CSV/Tests/Benchmark/Fixtures/orders_flow.csv into excel file.

CLI

New extractors that operate on files should also be added to the CLI, there are few commands that are critical:

src/cli/src/Flow/CLI/Command/FileReadCommand.php
src/cli/src/Flow/CLI/Command/FileSchemaCommand.php
src/cli/src/Flow/CLI/Command/FileRowsCountCommand.php
src/cli/src/Flow/CLI/Command/FileConvertCommand.php
src/cli/src/Flow/CLI/Command/FileAnalyzeCommand.php

In this case you will need to also add trait ExcelOptions that would define options to pass into those commands.

Codebase

Try to use dsl functions across the codebase, especially for types, rows, row, schema, extractors, loaders, dataframe, transformations, scalar functions.

That applies also to tests (actually mostly to tests)

Documentation

As you pointed out many times, docs and examples sucks right now (I'm perfectly aware and constantly trying to find time and motivation to improve it).

Each new extractor should come with:

Examples

You should add an example under Data reading topic.

When extractor provides options they should be also documented in example description. (put a description.md file in the example folder)

On top of that it would be good to put some explanation in the dsl function docblock as those functions are all part of the documentation later.

Since this extractor can't read from remote filesystems it should also be mentioned in the example description and maybe also in dsl function docblocks.

Api References

Look at phpdoc folder and build:docs:api composer command.

Documentation

If this is a new adapter, make sure it's available on the website: web/landing/templates/documentation/navigation.html.twig

And that it comes with it's own readme file in documentation/components/adapters that is also listed in documentation/installation.md.
In that readme file there should be a simple example about how to use that extractor with Flow.

In order to generate commits stats new adapters/components needs to be also added to manifest.json as well as here: .github/workflows/monorepo-split.yml

I believe that should cover most of it, I already created a repo for this adapter http://github.com/flow-php/etl-adapter-excel

If I will think about anything else I will let you know as it's hard to write complete spec at one go but I really think that should be the most critical part.

As for Loader it should also be added but this can happen later

norberttech

I just wen through this PR from my phone, will do my best to review it later from a laptop but its a bit busy weekend so no promises.

In general it looks good, maybe I missed it but I haven't seen any tests for CLI commands that would cover excel input? (review from phone is not the most convenient so I might just miss them).

src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/DSL/functions.php

src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/ExcelExtractor.php

src/adapter/etl-adapter-excel/tests/Flow/ETL/Adapter/Excel/Tests/Unit/ExcelExtractorTest.php

src/cli/src/Flow/CLI/Command/Traits/CSVOptions.php

documentation/components/adapters/excel.md

norberttech · 2025-05-18T16:08:26Z

I think the code looks great, but there are two small things still missing:

test suite workflow needs to be modified to execute excel test suite
monorepo-split needs to modified to split excel into https://github.com/flow-php/etl-adapter-excel

stloyd requested a review from norberttech as a code owner May 10, 2025 10:29

github-actions bot added the size: L label May 10, 2025

stloyd force-pushed the excel-adapter branch from df31287 to 44d3058 Compare May 10, 2025 10:39

github-actions bot added size: XL and removed size: L labels May 10, 2025

norberttech reviewed May 10, 2025

View reviewed changes

src/adapter/etl-adapter-excel/src/Flow/ETL/Adapter/Excel/ExcelExtractor.php Outdated Show resolved Hide resolved

norberttech requested changes May 10, 2025

View reviewed changes

norberttech modified the milestone: 0.17.0 May 11, 2025

norberttech added this to Roadmap May 17, 2025

github-project-automation bot moved this to Todo in Roadmap May 17, 2025

norberttech moved this from Todo to In Progress in Roadmap May 17, 2025

stloyd force-pushed the excel-adapter branch 4 times, most recently from f040d58 to 537afc7 Compare May 17, 2025 12:53

stloyd requested a review from norberttech May 17, 2025 12:56

stloyd force-pushed the excel-adapter branch from 537afc7 to 74562db Compare May 17, 2025 13:01

norberttech reviewed May 17, 2025

View reviewed changes

stloyd force-pushed the excel-adapter branch from 74562db to 1744b2e Compare May 17, 2025 16:04

stloyd requested a review from norberttech May 17, 2025 16:10

stloyd force-pushed the excel-adapter branch 3 times, most recently from bfa9bb3 to e8b2c43 Compare May 18, 2025 07:54

stloyd force-pushed the excel-adapter branch from e8b2c43 to 762d288 Compare May 18, 2025 08:33

stloyd changed the title ~~Add a new Excel Adapter with simple extractor~~ [Excel] Add a new Excel Adapter with simple extractor May 18, 2025

stloyd added this to the 0.17.0 milestone May 18, 2025

norberttech approved these changes May 18, 2025

View reviewed changes

documentation/components/adapters/excel.md Outdated Show resolved Hide resolved

Add a new Excel Adapter with an extractor

11f831a

stloyd force-pushed the excel-adapter branch from c479d29 to 11f831a Compare May 19, 2025 07:13

stloyd changed the title ~~[Excel] Add a new Excel Adapter with simple extractor~~ [Excel] Add a new Excel Adapter with an extractor May 19, 2025

norberttech merged commit 8ff01f8 into flow-php:1.x May 19, 2025
25 of 26 checks passed

github-project-automation bot moved this from In Progress to Done in Roadmap May 19, 2025

stloyd deleted the excel-adapter branch May 19, 2025 07:33

Uh oh!

[Excel] Add a new Excel Adapter with an extractor #1643

[Excel] Add a new Excel Adapter with an extractor #1643

Uh oh!

Conversation

stloyd commented May 10, 2025

Change Log

Added

Fixed

Changed

Removed

Deprecated

Security

Description

Uh oh!

github-actions bot commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Flow PHP - Benchmarks

Uh oh!

codecov bot commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

norberttech commented May 10, 2025

Uh oh!

norberttech commented May 10, 2025

Uh oh!

stloyd commented May 10, 2025

Uh oh!

norberttech commented May 10, 2025

Uh oh!

Uh oh!

norberttech left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Features

Dependencies

Tests

Options and Configuration

Remote Filesystem

Partitioning

Limit

Benchmarks

CLI

Codebase

Documentation

Examples

Api References

Documentation

Uh oh!

norberttech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

norberttech commented May 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented May 10, 2025 •

edited

Loading

codecov bot commented May 10, 2025 •

edited

Loading

norberttech left a comment •

edited

Loading