Skip to content

prrao87/duckdb-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuckDB, Polars and Pandas performance comparison

This repo contains code that highlights that pandas, despite its maturity and importance to the Python data science ecosystem, is far slower than either polars or duckdb, while being less efficient with memory usage, because it materializes the entire data into memory (as opposed to polars and duckdb which are lazy and only materialize the data when needed).

  • pandas is a popular Python DataFrame library for data manipulation and analysis, whose internals are in C++ -- the latest version of pandas (2.x) is used in this benchmark.
  • polars is a newer DataFrame library for data manipulation and analysis, whose internals are in Rust, and is far more amenable to multi-threading DataFrame ops than pandas.
  • duckdb is a high-performance embedded database that can be queried via a rich SQL dialect. Its core is written in C++, and is designed to be fast, reliable and easy to use, and very amenable to transformation to either pandas or polars

Both polars and duckdb (being written in Rust and C++ respectively) have had ample opportunity to learn from the pain points of pandas and the general PyData ecosystem over the years, allowing them to leverage lessons from modern database theory and the power of the Apache Arrow ecosystem.

Setup

Install a Python virtual environment and then install the dependencies via requirements.txt as follows.

python -m venv .venv  # python -> python 3.10+
source .venv/bin/activate
python -m pip install -U pip wheel  # Upgrade pip and install the wheel package first
python -m pip install -r requirements.txt

Generate dataset

The benchmark task is basically to use any one of pandas, polars, or duckdb to generate an artificial dataset of persons, the companies they held work positions in, and their locations. The input dataset we begin with is the 7+ million companies dataset from Kaggle, which is preprocessed into a parquet file.

The dataset is generated by randomly sampling from the input dataset, and then randomly generating positions for each person with repetition, allowing a one-to-many relationship (the same person and multiple company positions).

Benchmark

The example results and timing benchmarks to generate the full dataset are shown below. The benchmarks are run using the pytest-benchmark library which itself depends on pytest.

Conditions

  • Raw dataset being generated contains 1M persons and 10M positions at companies that these 1M persons have held
  • Macbook Pro M2, 16 GB RAM
  • Average of 3 runs (for each of pandas, polars and duckdb)
  • Garbage collector timing disabled during benchmark

Results

cd perf_study
$ pytest benchmark.py --benchmark-min-rounds=3 --benchmark-disable-gc

==================================================================================================== test session starts ====================================================================================================
platform darwin -- Python 3.11.2, pytest-7.4.0, pluggy-1.2.0
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=True min_rounds=3 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /code/embedded-dbs/perf_study
plugins: benchmark-4.0.0
collected 3 items                                                                                                                                                                                                           

benchmark.py ...                                                                                                                                                                                                      [100%]


---------------------------------------------------------------------------------- benchmark: 3 tests ----------------------------------------------------------------------------------
Name (time in s)              Min                Max               Mean            StdDev             Median               IQR            Outliers     OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_duckdb      3.6492 (1.0)       4.0477 (1.0)       3.8405 (1.0)      0.1998 (1.0)       3.8245 (1.0)      0.2989 (1.0)           1;0  0.2604 (1.0)           3           1
test_benchmark_polars      5.4167 (1.48)      6.4685 (1.60)      5.7704 (1.50)     0.6046 (3.03)      5.4260 (1.42)     0.7888 (2.64)          1;0  0.1733 (0.67)          3           1
test_benchmark_pandas     19.3788 (5.31)     19.8074 (4.89)     19.5716 (5.10)     0.2175 (1.09)     19.5287 (5.11)     0.3215 (1.08)          1;0  0.0511 (0.20)          3           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=============================================================================================== 3 passed in 144.33s (0:02:24) ===============================================================================================

Results

Average over 3 runs:

Approach Time (sec) Runtime increase vs. DuckDB
duckdb 3.84 1.0
polars 5.77 1.5
pandas 19.57 5.1

As can be seen, duckdb is the fastest 🔥, followed cloesly by polars. pandas is the slowest by a factor of 5x when compared to DuckDB.

About

Compare DuckDB, Polars and Pandas for generating an artificial dataset of persons and companies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages