This project compares different methods for reading CSV files and inserting them into databases, including:
- PyArrow + ADBC - Using PyArrow to read CSVs and ADBC driver for Postgres
- Polars + ADBC - Using Polars to read CSVs and ADBC driver for Postgres
- psycopg2 + COPY - Using Python's csv module and psycopg2's COPY command
- DuckDB + COPY - Using DuckDB's native COPY command
- Python 3.12+
- Docker and Docker Compose (for local Postgres)
uvpackage manager (or install dependencies manually)
- Install dependencies:
uv syncOr with pip:
pip install -e .Start a local PostgreSQL database using Docker Compose:
docker compose up -dThis will start a Postgres 16 container with:
- Host: localhost
- Port: 5432
- Database: postgres
- Username: postgres
- Password: postgres
The container will be named adbc-postgres and data will be persisted in a Docker volume.
To check if Postgres is running:
docker-compose psTo view logs:
docker-compose logs postgresTo stop the Postgres container:
docker-compose downTo stop and remove all data:
docker-compose down -vAll scripts read CSV files from data/uncompressed/ and insert them into databases. Each creates a different table name so you can compare results.
Uses PyArrow's dataset API to read CSVs and ADBC driver to insert into Postgres.
python pyarrow_adbc_driver.pyTable: divvy_tripdata
Features:
- Uses PyArrow's efficient dataset API for reading multiple CSV files
- Converts to Arrow table format
- Uses ADBC's
adbc_ingest()for bulk insertion - Creates table with proper type mapping
Uses Polars to read CSVs and Polars' built-in write_database() with ADBC engine.
python polars_adbc_driver.pyTable: divvy_tripdata_polars
Features:
- Uses Polars for fast CSV reading
- Leverages Polars' native
write_database()method - Automatic schema inference and table creation
- Uses ADBC engine under the hood
Uses Python's built-in csv module and psycopg2's COPY command (fastest psycopg2 method).
python psycopg2_driver.pyTable: divvy_tripdata_psycopg2_noarrow
Features:
- Pure Python CSV reading (no Arrow dependencies)
- Uses PostgreSQL's native COPY command via
copy_expert() - Type inference from sample data
- Optimized for bulk inserts
Uses DuckDB's native COPY command to load CSV files directly.
python duckdb_copy_driver.pyTable: divvy_tripdata_duckdb (in DuckDB database file)
Features:
- Creates a local DuckDB database file (
divvy_data.duckdb) - Uses DuckDB's
COPY FROMcommand - Automatic schema inference
- Stores data in DuckDB's columnar format
All Postgres scripts (except DuckDB) use the Docker Postgres setup by default:
# Start Postgres
docker-compose up -d
# Run any script
python pyarrow_adbc_driver.py
python polars_adbc_driver.py
python psycopg2_driver.pyYou can override Postgres connection parameters using environment variables:
export PGHOST=your-host
export PGPORT=5432
export PGDATABASE=your-database
export PGUSER=your-user
export PGPASSWORD=your-password
python pyarrow_adbc_driver.pyAll scripts include timing information showing:
- Total insertion time
- Rows per second throughput
- Row count verification
This allows you to compare the performance of different approaches:
- ADBC methods (PyArrow/Polars) - Leverage Arrow format for zero-copy transfers
- psycopg2 COPY - Uses PostgreSQL's native bulk loading
- DuckDB COPY - Optimized columnar database with native CSV loading
Run all scripts and compare the timing results to see which method works best for your use case!