gh-137627: Make `csv.Sniffer.sniff()` delimiter detection 1.6x faster by maurycy · Pull Request #137628 · python/cpython

maurycy · 2025-08-11T02:57:13Z

The basic idea is not to iterate over all ASCII characters and count their frequency on each line in _guess_delimiter but only over present characters, and just backfill zeros.

Benchmark

There is no csv.Sniffer benchmark in pyperformance, so I created a simple benchmark instead:

import csv, pathlib, pyperf
def sniff(s): csv.Sniffer()._guess_delimiter(s, None)
r = pyperf.Runner()
sizes = [1024, 2048, 4096]
for file in pathlib.Path("/home/maurycy/CSVsniffer/CSV/").glob("*.csv"):
    for s in sizes:
        with file.open() as f:
            try:
                r.bench_func(f"csv_sniff({file.name}, {s})", sniff, f.read(s))
            except UnicodeDecodeError:
                pass

using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too. Unfortunately, it takes a few hours to run.

The result:

Geometric mean: 1.60x faster

The full results (compare_to --table --table-format=md, and JSON files):

https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set

Correctness

I created a simple script to confirm that the output of Sniffer()._guess_delimiter() did not change. It uses ~1100 CSV files from the CSVSniffer project (MIT License).

The script:

from pathlib import Path

import csv_96b7a2eba423b42320f15fd4974740e3e930bb8b as csv_original
import csv as csv_modified

# https://github.com/ws-garcia/CSVsniffer
CSV_DIR_PATH = Path("/home/maurycy/CSVsniffer/")
SAMPLE = 1024

csv_files = list(CSV_DIR_PATH.rglob("*.[cC][sS][vV]"))


def differ(path, original, modified):
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        data = f.read(SAMPLE)
        return original(data) != modified(data)


original_guess_delimiter = (
    lambda data: csv_original.Sniffer()._guess_delimiter(data, delimiters=None)
)
modified_guess_delimiter = (
    lambda data: csv_modified.Sniffer()._guess_delimiter(data, delimiters=None)
)

differences, total = (
    sum(
        differ(path, original_guess_delimiter, modified_guess_delimiter)
        for path in csv_files
    ),
    len(csv_files),
)
print(f"{total} CSV files in '{CSV_DIR_PATH}', {differences} differences.")

where Lib/csv_96b7a2eba423b42320f15fd4974740e3e930bb8b.py is https://github.com/python/cpython/blob/96b7a2eba423b42320f15fd4974740e3e930bb8b/Lib/csv.py from the main branch, and:

% git rev-parse HEAD
8614756ea48060aa8c4d4c1508b0221e1b50d263
% git status
## csv-sniffer-counter-set...origin/csv-sniffer-counter-set
?? Lib/csv_96b7a2eba423b42320f15fd4974740e3e930bb8b.py
?? csv-sniffer-diff.py

The result:

% ./python csv-sniffer-diff.py
1118 CSV files in '/home/maurycy/CSVsniffer/', 0 differences.

Environment

% ./python -c "import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))"
'--with-lto' '--enable-optimizations'

sudo ./python -m pyperf system tune ensured.

Notes

The optimization is in csv.Sniffer()._read_delimiter() which runs only if regular expressions in csv.Sniffer()._guess_quote_and_delimiter() failed, so there's no guarantee that csv.Sniffer().sniff() will always be faster
My original patch relied on confusing set operations.

Issue: csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

picnixz

Are the benchmarks done with a POG+LTO build?

AA-Turner

I'm not a CSV expert, but here is a cursory review of the set logic. You should provide a (range of) benchmarks to back up the claim that it is twice as fast, though, ideally using pyperformance.

maurycy · 2025-08-11T15:50:36Z

@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with pyperformance, rephrase NEWS, and add in a whatsnew.

picnixz · 2025-08-12T07:43:48Z

Benchmarks without optimizations are not relevant so just run those with.

maurycy · 2025-08-13T10:43:09Z

@picnixz @AA-Turner @ZeroIntensity

Thank you for all the comments:

I created a much more rigorous benchmark than before, see the results: https://github.com/maurycy/cpython-test/tree/main/csv-sniffer-counter-set The benchmark is now on par with how people use the class.
I've changed the approach, ditching confusing set operations. The speed up now comes mostly from not iterating over ascii over and over, and more efficient zero backfilling.
There is nothing about csv.Sniffer() in pyperformance, so I was a bit stuck here.
I updated the docs.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

maurycy · 2025-08-19T10:49:14Z

@picnixz

Thank you for your yesterday's review:

I renamed the variables and simplified the zero-bucket, as per your suggestion,
I added the tests for the tie break (insert v. append of the zero bucket) to confirm that the behavior has not changed,
I updated the docs and the benchmark (it's now 1.6x faster!)

Thank you!

ZeroIntensity · 2025-08-19T11:00:13Z

Love the enthusiasm, but please try to avoid continuously rebasing (pressing "update branch"). It wastes CI time and also puts a ding in all of our inboxes. Updating the branch should generally only be done to resolve merge conflicts.

maurycy · 2025-08-24T12:50:20Z

FYI: I created a simple check ensuring that the output of csv.Sniffer()._guess_delimiter() did not change:

gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster #137628 (comment)

picnixz

Overall, looks good but please add a test with corner cases when the guessing algorithm makes multiple rounds.

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

picnixz

This looks fine to me but I want another core dev to have a look just in case I missed some obvious things (maybe @serhiy-storchaka?)

maurycy · 2025-09-08T21:04:34Z

@ZeroIntensity @AA-Turner What do you think about this PR now?

hugovk

Thanks!

hugovk · 2025-10-23T12:28:31Z

Thanks!

bedevere-bot · 2025-10-23T13:47:18Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot AMD64 Windows11 Bigmem 3.x (tier-1) has failed when building commit 6be6f8f.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/1079/builds/7354) and take a look at the build logs.
Check if the failure is related to this commit (6be6f8f) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/1079/builds/7354

Failed tests:

test_zipimport

Failed subtests:

testZip64LargeFile - test.test_zipimport.UncompressedZipImportTestCase.testZip64LargeFile
testZip64LargeFile - test.test_zipimport.ZStdCompressedZipImportTestCase.testZip64LargeFile
testZip64LargeFile - test.test_zipimport.DeflateCompressedZipImportTestCase.testZip64LargeFile

Summary of the results of the build (if available):

==

Click to see traceback logs

Traceback (most recent call last):
  File "R:\buildarea\3.x.ambv-bb-win11.bigmem\build\Lib\test\test_zipimport.py", line 934, in testZip64LargeFile
    with open(os_helper.TESTFN, "wb") as f:
         ~~~~^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device


Traceback (test.test_zipimport.DeflateCompressedZipImportTestCase.testTraceback) ... ok


Traceback (test.test_zipimport.UncompressedZipImportTestCase.testTraceback) ... ok


Traceback (most recent call last):
  File "R:\buildarea\3.x.ambv-bb-win11.bigmem\build\Lib\test\test_zipimport.py", line 997, in testZip64LargeFile
    with open(TEMP_ZIP, "wb") as f:
         ~~~~^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device


Traceback (most recent call last):
  File "R:\buildarea\3.x.ambv-bb-win11.bigmem\build\Lib\test\test_zipimport.py", line 1000, in testZip64LargeFile
    f.seek(offset, os.SEEK_SET)
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on device


Traceback (test.test_zipimport.ZStdCompressedZipImportTestCase.testTraceback) ... ok

hugovk · 2025-10-23T14:02:12Z

Buildbot failure unrelated: "OSError: [Errno 28] No space left on device".

I'll inform the owner.

…faster (python#137628) Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

maurycy added 3 commits August 11, 2025 04:40

do not iterate over all ascii

80be530

NEWS entry

2d636cf

bring back the comment

1f0b25e

bedevere-app Bot added the awaiting review label Aug 11, 2025

bedevere-app Bot mentioned this pull request Aug 11, 2025

csv.Sniffer._guess_delimiter() iterates over all ASCII on each line #137627

Closed

bang

601b2f1

maurycy changed the title ~~gh-137627: Make csv.Sniffer._guess_delimiter() 2x faster~~ gh-137627: Make csv.Sniffer.sniff() 2x faster Aug 11, 2025

document the public method

2dc0d41

AA-Turner reviewed Aug 11, 2025

View reviewed changes

Comment thread Lib/csv.py Outdated

import within Sniffer

f106da2

maurycy requested a review from AA-Turner August 11, 2025 05:37

Merge branch 'main' into csv-sniffer-counter-set

7f7dca1

picnixz reviewed Aug 11, 2025

View reviewed changes

Comment thread Lib/csv.py Outdated

Comment thread Misc/NEWS.d/next/Library/2025-08-11-04-52-18.gh-issue-137627.Ku5Yi2.rst Outdated

AA-Turner reviewed Aug 11, 2025

View reviewed changes

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

_ASCII_CHARS, set operators

2f1ea73

ZeroIntensity reviewed Aug 11, 2025

View reviewed changes

Comment thread Lib/csv.py Outdated

maurycy added 2 commits August 13, 2025 04:16

Merge branch 'main' into csv-sniffer-counter-set

4b50610

update docs, no set operations

07a336b

maurycy requested review from AA-Turner, ZeroIntensity and picnixz August 13, 2025 10:38

maurycy changed the title ~~gh-137627: Make csv.Sniffer.sniff() 2x faster~~ gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.5x faster Aug 13, 2025

maurycy added 2 commits August 15, 2025 19:45

Merge branch 'main' into csv-sniffer-counter-set

080dbfc

Merge branch 'main' into csv-sniffer-counter-set

0b5bcdd

picnixz reviewed Aug 18, 2025

View reviewed changes

Comment thread Doc/whatsnew/3.15.rst Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

maurycy and others added 2 commits August 18, 2025 20:38

Update Lib/csv.py

2ccaac0

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

move whatsnew to Optimizations

36fc9d9

maurycy requested a review from picnixz August 19, 2025 10:49

Merge branch 'main' into csv-sniffer-counter-set

8614756

picnixz reviewed Aug 28, 2025

View reviewed changes

Comment thread Lib/test/test_csv.py

Comment thread Lib/test/test_csv.py Outdated

Comment thread Lib/test/test_csv.py

maurycy and others added 2 commits August 28, 2025 17:11

Update Lib/test/test_csv.py

a95b16a

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

vs

59c9395

picnixz reviewed Aug 28, 2025

View reviewed changes

Comment thread Doc/whatsnew/3.15.rst Outdated

Comment thread Lib/csv.py Outdated

maurycy and others added 4 commits August 28, 2025 17:24

Update Doc/whatsnew/3.15.rst

3e7beaa

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Update Lib/csv.py

c9cd73e

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

,: tie

7a3c974

be consistent

cecb897

picnixz reviewed Aug 30, 2025

View reviewed changes

Comment thread Lib/test/test_csv.py Outdated

check the exc msg

ab871e0

picnixz reviewed Aug 30, 2025

View reviewed changes

Comment thread Lib/test/test_csv.py

Update Lib/test/test_csv.py

5181f5d

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

picnixz approved these changes Aug 30, 2025

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting review labels Aug 30, 2025

hugovk approved these changes Sep 11, 2025

View reviewed changes

Comment thread Lib/csv.py Outdated

Comment thread Lib/csv.py Outdated

s/charFrequency/char_frequency/

018e580

hugovk merged commit 6be6f8f into python:main Oct 23, 2025
45 checks passed

bedevere-app Bot removed the awaiting merge label Oct 23, 2025

maurycy deleted the csv-sniffer-counter-set branch October 23, 2025 13:13

StanFromIreland pushed a commit to StanFromIreland/cpython that referenced this pull request Dec 6, 2025

pythongh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x …

662abf9

…faster (python#137628) Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>

Uh oh!

Conversation

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Correctness

Environment

Notes

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AA-Turner left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maurycy commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Aug 12, 2025

Uh oh!

maurycy commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maurycy commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity commented Aug 19, 2025

Uh oh!

maurycy commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

maurycy commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hugovk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hugovk commented Oct 23, 2025

Uh oh!

bedevere-bot commented Oct 23, 2025

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Uh oh!

hugovk commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

maurycy commented Aug 11, 2025 •

edited

Loading

AA-Turner left a comment •

edited

Loading

maurycy commented Aug 11, 2025 •

edited

Loading

maurycy commented Aug 13, 2025 •

edited

Loading

maurycy commented Aug 19, 2025 •

edited

Loading

maurycy commented Aug 24, 2025 •

edited

Loading

maurycy commented Sep 8, 2025 •

edited

Loading