Skip to content

Conversation

@norberttech
Copy link
Member

Resolves: #1738

Change Log


Added

Fixed

  • reading multiline strings in CSV files

Changed

Removed

Deprecated

Security

@github-actions
Copy link
Contributor

github-actions bot commented Jun 27, 2025

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+------------------------+------+-----+-----------------+------------------+------------------+
| benchmark             | subject                | revs | its | mem_peak        | mode             | rstdev           |
+-----------------------+------------------------+------+-----+-----------------+------------------+------------------+
| CSVExtractorBench     | bench_extract_10k      | 1    | 3   | 4.828mb +0.36%  | 443.738ms +1.96% | ±0.75% +2937.01% |
| ExcelExtractorBench   | bench_extract_10k_ods  | 1    | 3   | 65.540mb +0.00% | 1.050s -0.88%    | ±0.77% -20.53%   |
| ExcelExtractorBench   | bench_extract_10k_xlsx | 1    | 3   | 67.585mb +0.00% | 1.683s -2.90%    | ±0.91% -60.70%   |
| JsonExtractorBench    | bench_extract_10k      | 1    | 3   | 5.437mb +0.01%  | 1.147s -1.30%    | ±1.24% +185.57%  |
| ParquetExtractorBench | bench_extract_10k      | 1    | 3   | 86.398mb +0.00% | 884.377ms -0.85% | ±1.66% +139.99%  |
| TextExtractorBench    | bench_extract_10k      | 1    | 3   | 4.568mb +0.03%  | 43.865ms +3.88%  | ±2.87% +342.07%  |
| XmlExtractorBench     | bench_extract_10k      | 1    | 3   | 4.553mb +0.03%  | 602.740ms -0.30% | ±2.04% +233.84%  |
+-----------------------+------------------------+------+-----+-----------------+------------------+------------------+
Transformers
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                       | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEachEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 18.563mb +0.01%  | 71.242ms -0.57% | ±0.40% -35.26% |
| RenameEntryTransformerBench     | bench_transform_10k_rows | 1    | 3   | 123.302mb +0.00% | 65.033ms -1.62% | ±0.92% +66.42% |
+---------------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+------------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev           |
+--------------------+----------------+------+-----+------------------+-----------------+------------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 62.490mb +0.03%  | 87.186ms +1.21% | ±2.43% +60.97%   |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 80.586mb +0.00%  | 99.726ms -3.14% | ±1.86% +1154.91% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 166.298mb +0.00% | 1.996s -0.70%   | ±1.07% +505.09%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.869mb +0.01%  | 30.192ms +0.18% | ±0.93% +31.81%   |
+--------------------+----------------+------+-----+------------------+-----------------+------------------+
Building Blocks
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark         | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 42.513mb +0.00%  | 405.047ms -0.04% | ±0.77% -16.17%  |
| TypeDetectorBench | bench_type_detector        | 1    | 3   | 11.571mb +0.01%  | 82.244ms +0.18%  | ±0.97% +117.46% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 105.983mb +0.00% | 652.668ms +1.04% | ±0.81% -47.57%  |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 55.258mb +0.00%  | 323.040ms -1.29% | ±1.06% +264.09% |
| EntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.844mb +0.01%  | 69.080ms -0.59%  | ±0.43% -26.84%  |
| RowsBench         | bench_chunk_10_on_10k      | 2    | 3   | 93.454mb +0.00%  | 3.183ms -1.48%   | ±2.70% -10.94%  |
| RowsBench         | bench_diff_left_1k_on_10k  | 2    | 3   | 110.824mb +0.00% | 237.151ms -0.15% | ±1.10% +327.49% |
| RowsBench         | bench_diff_right_1k_on_10k | 2    | 3   | 93.544mb +0.00%  | 23.920ms -0.08%  | ±0.56% -33.43%  |
| RowsBench         | bench_drop_1k_on_10k       | 2    | 3   | 94.328mb +0.00%  | 1.232ms -2.84%   | ±1.79% +9.87%   |
| RowsBench         | bench_drop_right_1k_on_10k | 2    | 3   | 94.328mb +0.00%  | 1.240ms -2.40%   | ±2.64% +202.80% |
| RowsBench         | bench_entries_on_10k       | 2    | 3   | 92.489mb +0.00%  | 3.117ms -2.49%   | ±2.49% +167.83% |
| RowsBench         | bench_filter_on_10k        | 2    | 3   | 93.018mb +0.00%  | 15.181ms -3.85%  | ±1.56% +70.68%  |
| RowsBench         | bench_find_on_10k          | 2    | 3   | 93.018mb +0.00%  | 15.225ms -3.04%  | ±2.34% +2.10%   |
| RowsBench         | bench_find_one_on_10k      | 10   | 3   | 91.707mb +0.00%  | 1.806μs +0.68%   | ±2.57% -3.64%   |
| RowsBench         | bench_first_on_10k         | 10   | 3   | 91.707mb +0.00%  | 0.400μs 0.00%    | ±0.00% 0.00%    |
| RowsBench         | bench_flat_map_on_1k       | 2    | 3   | 100.767mb +0.00% | 14.234ms -1.48%  | ±0.47% -49.37%  |
| RowsBench         | bench_map_on_10k           | 2    | 3   | 130.194mb +0.00% | 65.951ms +1.07%  | ±0.67% +135.99% |
| RowsBench         | bench_merge_1k_on_10k      | 2    | 3   | 93.538mb +0.00%  | 1.472ms +34.57%  | ±0.41% -7.61%   |
| RowsBench         | bench_partition_by_on_10k  | 2    | 3   | 96.907mb +0.00%  | 62.687ms +1.68%  | ±1.58% -1.14%   |
| RowsBench         | bench_remove_on_10k        | 2    | 3   | 94.591mb +0.00%  | 3.759ms +10.91%  | ±3.64% +300.34% |
| RowsBench         | bench_sort_asc_on_1k       | 2    | 3   | 92.069mb +0.00%  | 40.564ms +1.99%  | ±0.29% -72.32%  |
| RowsBench         | bench_sort_by_on_1k        | 2    | 3   | 92.069mb +0.00%  | 39.673ms +0.05%  | ±0.25% -83.01%  |
| RowsBench         | bench_sort_desc_on_1k      | 2    | 3   | 92.069mb +0.00%  | 40.129ms +1.57%  | ±1.14% -38.89%  |
| RowsBench         | bench_sort_entries_on_1k   | 2    | 3   | 94.150mb +0.00%  | 8.035ms -0.38%   | ±1.33% +267.78% |
| RowsBench         | bench_sort_on_1k           | 2    | 3   | 91.900mb +0.00%  | 29.349ms -0.44%  | ±1.40% +37.64%  |
| RowsBench         | bench_take_1k_on_10k       | 10   | 3   | 91.707mb +0.00%  | 13.836μs -21.89% | ±2.02% -43.25%  |
| RowsBench         | bench_take_right_1k_on_10k | 10   | 3   | 91.707mb +0.00%  | 15.430μs -11.91% | ±1.51% +21.96%  |
| RowsBench         | bench_unique_on_1k         | 2    | 3   | 110.825mb +0.00% | 237.726ms -2.77% | ±0.35% -44.19%  |
+-------------------+----------------------------+------+-----+------------------+------------------+-----------------+

@codecov
Copy link

codecov bot commented Jun 27, 2025

Codecov Report

Attention: Patch coverage is 96.96970% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.31%. Comparing base (1f550f0) to head (9de7d92).
Report is 3 commits behind head on 1.x.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##              1.x    #1740      +/-   ##
==========================================
+ Coverage   81.26%   81.31%   +0.04%     
==========================================
  Files         715      717       +2     
  Lines       19901    19949      +48     
==========================================
+ Hits        16173    16221      +48     
  Misses       3728     3728              
Components Coverage Δ
etl 88.40% <ø> (ø)
cli 85.46% <ø> (ø)
lib-array-dot 94.56% <ø> (ø)
lib-azure-sdk 61.35% <ø> (ø)
lib-doctrine-dbal-bulk 93.88% <ø> (ø)
lib-filesystem 78.02% <ø> (ø)
lib-types 53.43% <ø> (ø)
lib-parquet 84.17% <ø> (ø)
lib-parquet-viewer 83.11% <ø> (ø)
lib-snappy 91.62% <ø> (+0.93%) ⬆️
bridge-filesystem-async-aws 90.38% <ø> (ø)
bridge-filesystem-azure 89.92% <ø> (ø)
bridge-monolog-http 97.04% <ø> (ø)
symfony-http-foundation 74.41% <ø> (ø)
adapter-chartjs 86.70% <ø> (ø)
adapter-csv 88.85% <96.96%> (+1.10%) ⬆️
adapter-doctrine 89.89% <ø> (ø)
adapter-elasticsearch 97.23% <ø> (ø)
adapter-google-sheet 83.87% <ø> (ø)
adapter-http 58.10% <ø> (ø)
adapter-json 87.98% <ø> (ø)
adapter-logger 53.84% <ø> (ø)
adapter-meilisearch 97.95% <ø> (ø)
adapter-parquet 78.64% <ø> (ø)
adapter-text 84.44% <ø> (ø)
adapter-xml 82.73% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

$buffer .= $rawLine;

if (!\str_contains($buffer, $this->enclosure)) {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : rtrim($buffer, "\r\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the BOM check be only on line zero? There is no point in repeating the check on every line, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's how it works now, it checks if removeBOM is needed and if it's needed then it checks if the line number is 0 (first line).

Comment on lines +29 to +48
foreach ($stream->readLines(length: $this->charactersReadInLine) as $rawLine) {
$buffer .= $rawLine;

if (!\str_contains($buffer, $this->enclosure)) {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
if ($this->isCompleteCSVRecord($buffer)) {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : \rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
$buffer .= "\n";
}
}
}

if ($buffer !== '') {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : \rtrim($buffer, "\r\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this, probably:

Suggested change
foreach ($stream->readLines(length: $this->charactersReadInLine) as $rawLine) {
$buffer .= $rawLine;
if (!\str_contains($buffer, $this->enclosure)) {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
if ($this->isCompleteCSVRecord($buffer)) {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : \rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
$buffer .= "\n";
}
}
}
if ($buffer !== '') {
yield $this->removeBOM && $lineNumber === 0 ? $this->removeBOMFromLine(\rtrim($buffer, "\r\n")) : \rtrim($buffer, "\r\n");
foreach ($stream->readLines(length: $this->charactersReadInLine) as $rawLine) {
if ($this->removeBOM && $lineNumber === 0) {
$rawLine = $this->removeBOMFromLine($rawLine);
}
$buffer .= $rawLine;
if (!\str_contains($buffer, $this->enclosure)) {
yield rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
if ($this->isCompleteCSVRecord($buffer)) {
yield \rtrim($buffer, "\r\n");
$lineNumber++;
$buffer = '';
} else {
$buffer .= "\n";
}
}
}
if ($buffer !== '') {
yield \rtrim($buffer, "\r\n");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but line number is not going to be incremented until it's full line, so at the end of the day if the first line contains multi line string, remove BOM will be executed as many times as many lines that multiline string has

@norberttech norberttech merged commit 7f0cd6d into 1.x Jun 27, 2025
21 checks passed
@norberttech norberttech deleted the 1738-bug-reading-csv-with-multiline-strings branch June 27, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Reading CSV with multiline strings

3 participants