Skip to content

danielbeach/testingArrowADBCdrivers

Repository files navigation

Testing Database Insert Performance

This project compares different methods for reading CSV files and inserting them into databases, including:

  • PyArrow + ADBC - Using PyArrow to read CSVs and ADBC driver for Postgres
  • Polars + ADBC - Using Polars to read CSVs and ADBC driver for Postgres
  • psycopg2 + COPY - Using Python's csv module and psycopg2's COPY command
  • DuckDB + COPY - Using DuckDB's native COPY command

Prerequisites

  • Python 3.12+
  • Docker and Docker Compose (for local Postgres)
  • uv package manager (or install dependencies manually)

Setup

  1. Install dependencies:
uv sync

Or with pip:

pip install -e .

Starting Local Postgres Database

Start a local PostgreSQL database using Docker Compose:

docker compose up -d

This will start a Postgres 16 container with:

  • Host: localhost
  • Port: 5432
  • Database: postgres
  • Username: postgres
  • Password: postgres

The container will be named adbc-postgres and data will be persisted in a Docker volume.

Checking Postgres Status

To check if Postgres is running:

docker-compose ps

To view logs:

docker-compose logs postgres

Stopping Postgres

To stop the Postgres container:

docker-compose down

To stop and remove all data:

docker-compose down -v

Available Scripts

All scripts read CSV files from data/uncompressed/ and insert them into databases. Each creates a different table name so you can compare results.

1. PyArrow + ADBC (pyarrow_adbc_driver.py)

Uses PyArrow's dataset API to read CSVs and ADBC driver to insert into Postgres.

python pyarrow_adbc_driver.py

Table: divvy_tripdata

Features:

  • Uses PyArrow's efficient dataset API for reading multiple CSV files
  • Converts to Arrow table format
  • Uses ADBC's adbc_ingest() for bulk insertion
  • Creates table with proper type mapping

2. Polars + ADBC (polars_adbc_driver.py)

Uses Polars to read CSVs and Polars' built-in write_database() with ADBC engine.

python polars_adbc_driver.py

Table: divvy_tripdata_polars

Features:

  • Uses Polars for fast CSV reading
  • Leverages Polars' native write_database() method
  • Automatic schema inference and table creation
  • Uses ADBC engine under the hood

3. psycopg2 + COPY (psycopg2_driver.py)

Uses Python's built-in csv module and psycopg2's COPY command (fastest psycopg2 method).

python psycopg2_driver.py

Table: divvy_tripdata_psycopg2_noarrow

Features:

  • Pure Python CSV reading (no Arrow dependencies)
  • Uses PostgreSQL's native COPY command via copy_expert()
  • Type inference from sample data
  • Optimized for bulk inserts

4. DuckDB + COPY (duckdb_copy_driver.py)

Uses DuckDB's native COPY command to load CSV files directly.

python duckdb_copy_driver.py

Table: divvy_tripdata_duckdb (in DuckDB database file)

Features:

  • Creates a local DuckDB database file (divvy_data.duckdb)
  • Uses DuckDB's COPY FROM command
  • Automatic schema inference
  • Stores data in DuckDB's columnar format

Connection Configuration

Using Default Connection (Docker Postgres)

All Postgres scripts (except DuckDB) use the Docker Postgres setup by default:

# Start Postgres
docker-compose up -d

# Run any script
python pyarrow_adbc_driver.py
python polars_adbc_driver.py
python psycopg2_driver.py

Using Custom Connection

You can override Postgres connection parameters using environment variables:

export PGHOST=your-host
export PGPORT=5432
export PGDATABASE=your-database
export PGUSER=your-user
export PGPASSWORD=your-password

python pyarrow_adbc_driver.py

Performance Comparison

All scripts include timing information showing:

  • Total insertion time
  • Rows per second throughput
  • Row count verification

This allows you to compare the performance of different approaches:

  • ADBC methods (PyArrow/Polars) - Leverage Arrow format for zero-copy transfers
  • psycopg2 COPY - Uses PostgreSQL's native bulk loading
  • DuckDB COPY - Optimized columnar database with native CSV loading

Run all scripts and compare the timing results to see which method works best for your use case!

About

testing Python's Arrow ADBC CSV to Postgres

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages