DuckDB, Polars and Pandas performance comparison

This repo contains code that highlights that pandas, despite its maturity and importance to the Python data science ecosystem, is far slower than either polars or duckdb, while being less efficient with memory usage, because it materializes the entire data into memory (as opposed to polars and duckdb which are lazy and only materialize the data when needed).

pandas is a popular Python DataFrame library for data manipulation and analysis, whose internals are in C++ -- the latest version of pandas (2.x) is used in this benchmark.
polars is a newer DataFrame library for data manipulation and analysis, whose internals are in Rust, and is far more amenable to multi-threading DataFrame ops than pandas.
duckdb is a high-performance embedded database that can be queried via a rich SQL dialect. Its core is written in C++, and is designed to be fast, reliable and easy to use, and very amenable to transformation to either pandas or polars

Both polars and duckdb (being written in Rust and C++ respectively) have had ample opportunity to learn from the pain points of pandas and the general PyData ecosystem over the years, allowing them to leverage lessons from modern database theory and the power of the Apache Arrow ecosystem.

Setup

Install a Python virtual environment and then install the dependencies via requirements.txt as follows.

python -m venv .venv  # python -> python 3.10+
source .venv/bin/activate
python -m pip install -U pip wheel  # Upgrade pip and install the wheel package first
python -m pip install -r requirements.txt

Generate dataset

The benchmark task is basically to use any one of pandas, polars, or duckdb to generate an artificial dataset of persons, the companies they held work positions in, and their locations. The input dataset we begin with is the 7+ million companies dataset from Kaggle, which is preprocessed into a parquet file.

The dataset is generated by randomly sampling from the input dataset, and then randomly generating positions for each person with repetition, allowing a one-to-many relationship (the same person and multiple company positions).

Benchmark

The example results and timing benchmarks to generate the full dataset are shown below. The benchmarks are run using the pytest-benchmark library which itself depends on pytest.

Conditions

Raw dataset being generated contains 1M persons and 10M positions at companies that these 1M persons have held
Macbook Pro M2, 16 GB RAM
Average of 3 runs (for each of pandas, polars and duckdb)
Garbage collector timing disabled during benchmark

Results

cd perf_study

$ pytest benchmark.py --benchmark-min-rounds=3 --benchmark-disable-gc

==================================================================================================== test session starts ====================================================================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=3 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /code/embedded-dbs/perf_study
plugins: benchmark-4.0.0
collected 3 items                                                                                                                                                                                                           

benchmark.py ...                                                                                                                                                                                                      [100%]


---------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------
Name (time in s)              Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_duckdb      3.6492 (1.0)       4.0477 (1.0)       3.8405 (1.0)      0.1998 (1.0)       3.8245 (1.0)      0.2989 (1.0)           1;0  0.2604 (1.0)           3           1
test_benchmark_polars      5.4167 (1.48)      6.4685 (1.60)      5.7704 (1.50)     0.6046 (3.03)      5.4260 (1.42)     0.7888 (2.64)          1;0  0.1733 (0.67)          3           1
test_benchmark_pandas     19.3788 (5.31)     19.8074 (4.89)     19.5716 (5.10)     0.2175 (1.09)     19.5287 (5.11)     0.3215 (1.08)          1;0  0.0511 (0.20)          3           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=============================================================================================== 3 passed in 144.33s (0:02:24) ===============================================================================================

Results

Average over 3 runs:

Approach	Time (sec)	Runtime increase vs. DuckDB
`duckdb`	3.84	1.0
`polars`	5.77	1.5
`pandas`	19.57	5.1

As can be seen, duckdb is the fastest 🔥, followed cloesly by polars. pandas is the slowest by a factor of 5x when compared to DuckDB.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
intro		intro
perf_study		perf_study
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuckDB, Polars and Pandas performance comparison

Setup

Generate dataset

Benchmark

Conditions

Results

Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

DuckDB, Polars and Pandas performance comparison

Setup

Generate dataset

Benchmark

Conditions

Results

Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages