gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster#137628
gh-137627: Make csv.Sniffer.sniff() delimiter detection 1.6x faster#137628hugovk merged 31 commits intopython:mainfrom
csv.Sniffer.sniff() delimiter detection 1.6x faster#137628Conversation
csv.Sniffer._guess_delimiter() 2x fastercsv.Sniffer.sniff() 2x faster
picnixz
left a comment
There was a problem hiding this comment.
Are the benchmarks done with a POG+LTO build?
|
@picnixz @AA-Turner I really appreciate your feedback! It's great. I will provide more benchmarks, including with enabled optimizations and ideally with |
|
Benchmarks without optimizations are not relevant so just run those with. |
csv.Sniffer.sniff() 2x fastercsv.Sniffer.sniff() delimiter detection 1.5x faster
|
@picnixz @AA-Turner @ZeroIntensity Thank you for all the comments:
|
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
|
Thank you for your yesterday's review:
Thank you! |
|
Love the enthusiasm, but please try to avoid continuously rebasing (pressing "update branch"). It wastes CI time and also puts a ding in all of our inboxes. Updating the branch should generally only be done to resolve merge conflicts. |
|
FYI: I created a simple check ensuring that the output of |
picnixz
left a comment
There was a problem hiding this comment.
Overall, looks good but please add a test with corner cases when the guessing algorithm makes multiple rounds.
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
picnixz
left a comment
There was a problem hiding this comment.
This looks fine to me but I want another core dev to have a look just in case I missed some obvious things (maybe @serhiy-storchaka?)
|
@ZeroIntensity @AA-Turner What do you think about this PR now? |
|
Thanks! |
|
|
Buildbot failure unrelated: "OSError: [Errno 28] No space left on device". I'll inform the owner. |
…faster (python#137628) Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
The basic idea is not to iterate over all ASCII characters and count their frequency on each line in
_guess_delimiterbut only over present characters, and just backfill zeros.Benchmark
There is no
csv.Snifferbenchmark inpyperformance, so I created a simple benchmark instead:using all 149 files from CSVSniffer (MIT License), reading only the sample, as recommended in docs.python.org example. That's what real users do, too. Unfortunately, it takes a few hours to run.
The result:
The full results (
compare_to --table --table-format=md, and JSON files):Correctness
I created a simple script to confirm that the output of
Sniffer()._guess_delimiter()did not change. It uses ~1100 CSV files from the CSVSniffer project (MIT License).The script:
where
Lib/csv_96b7a2eba423b42320f15fd4974740e3e930bb8b.pyis https://github.com/python/cpython/blob/96b7a2eba423b42320f15fd4974740e3e930bb8b/Lib/csv.py from themainbranch, and:The result:
Environment
sudo ./python -m pyperf system tuneensured.Notes
csv.Sniffer()._read_delimiter()which runs only if regular expressions incsv.Sniffer()._guess_quote_and_delimiter()failed, so there's no guarantee thatcsv.Sniffer().sniff()will always be fastercsv.Sniffer._guess_delimiter()iterates over all ASCII on each line #137627