PostgreSQL‘s COPY command offers a high-performance way to load external data files into tables. The copy from stdin form allows piping input directly into PostgreSQL for easy data imports without needing intermediate disk storage.
In this advanced guide, we’ll cover everything from use cases to optimized bulk loading techniques using copy from stdin.
When and Why to Use COPY FROM STDIN
COPY FROM STDIN shines for fast imports of large structured datasets. The two most common use cases are:
1. Initial bulk loading of data warehouses and databases – Need to populate an empty analytic database with 100s of GB/TB of historical data? COPY FROM STDIN can reliably ingest data 5-10x faster than repeated INSERT statements.
2. Periodic ETL batch updates – For routine needs like nightly imports of changed data, COPY FROM does a great job bringing in new rows from files with minimal coding. More efficient than external tables or ETL tools.
Compared to INSERTs, COPY wins on:
- Raw speed – Minimizes parse/commit overhead by staging then writing data in bulk
- Simplicity – Single command instead of complex scripts, just pipe to stdin
- Reliability – Atomic writes and built-in file verification checks
Other ideal use cases:
- Loading dataset backups from dump files
- Database migrations and consolidations
- Capturing change data from external sources
- Streaming CSV log data into analytics tables
In summary, any process requiring repetitive insertion of large structured data batches is the perfect fit for copy from stdin.
COPY FROM STDIN by Example
The easiest way to understand COPY FROM STDIN is to walk through some examples:
Ingesting CSV Files
For instance, importing user profile data from a CSV:
user_id,name,email
1,John Doe,john@doe.com
2,Jane Smith,jane@smith.org
Just pipe the CSV into COPY FROM STDIN:
COPY users FROM STDIN WITH CSV;
Then paste or input the CSV data, ending with delimiter . to finalize.
Loading Compressed Data
COPY FROM STDIN works great with compressed data too. Here we gzip compress the CSV before importing:
gzip users.csv
gzcat users.csv.gz | psql -c "COPY users FROM STDIN WITH CSV"
This avoids decompressing the large CSV file on disk first.
Importing JSON
For a structured JSON file like:
{"user_id": 1, "email": "john@doe.com"}
{"user_id": 2, "email": "jane@smith.org"}
We can handle the string escaping and conversions during import:
COPY users FROM STDIN WITH (FORMAT CSV, FORCE_QUOTE *, ESCAPE ‘\‘)
Then input the JSON data, and PostgreSQL will parse it per row.
As you can see, COPY FROM STDIN gives us an easy way to ingest data from virtually any pipe-able format.
Next let‘s dig deeper into recommended practices.
Best Practices for Optimized Bulk Loading
When importing large datasets, we need to minimize IO, memory, and CPU overhead to achieve peak load speed. Here are some COPY FROM STDIN tuning tips:
- Drop indexes/constraints – Adding data with active constraints or indexes boosts write amplification. Disable them before copy then re-enable after.
- Increase maintenance_work_mem – This temp buffer is crucial for COPY operations. Bump it from the default 16MB if able.
- Use multiple worker threads – If ingesting from high throughput pipes, run multiple COPY commands in parallel threads.
- Compression is your friend – Compressed data reduces IO and network transfer time.
- Stage then copy – Stage uncompressed data on a replica then push to the master for writing, avoids production load.
- Transaction wrap – Wrap COPY statements in a transaction to allow rollback on errors.
Also set synchronize_seqscans = off for faster scanning during copy.
Following those guidelines allows scaling COPY FROM STDIN imports to gigabyte throughput levels.
Comparing COPY to ETL Tools and External Tables
A common question developers have is – why use COPY over other PostgreSQL data ingestion methods? Here‘s a comparison:
External tables map external files to a read-only table, allowing SQL selects directly on top of data files. This simplifies access without needing to first load into a standard table.
ETL tools like Stitch offer replication features to move data from numerous sources into PostgreSQL and other data warehouses. Handy for capturing changes but require additional licensing.
COPY FROM STDIN focuses exclusively on simple, performant data loading directly into real tables. Avoiding the abstraction layer of external tables or the complexity of ETLs lets it achieve much faster bulk insert throughput in a leaner implementation.
In practice, these techniques are often combined – using COPY FROM initially then switching to change data capture through external tables or ETLs makes sense. But for that initial bulk seeding, COPY can‘t be beat on flexibility, simplicity and speed.
Avoiding Common Pitfalls
While COPY FROM STDIN offers great power, it can trip teams up if not used properly:
- Format errors – Forgetting a CSV or JSON setting then attempting to import badly formatted data
- Constraint violations – Importing invalid rows that violate NOT NULL or CHECK constraints, causing aborted loads
- Data type issues – Mismatched data types lead to conversion failures and unexpected truncations
- Lock contention – Attempting normal DML during large COPY operations can grind performance to a halt
Fortunately these are easy to mitigate:
- Double check format options match the actual input data structure
- Review constraints and dangerously disable ones that may trigger during import
- Explicitly cast any columns at risk of type mismatches
- Run COPY operations during maintenance windows then re-enable constraints and indexes after
Adopting these safe COPY practices prevents the bulk of frustrations teams experience.
Key Metrics and Benchmarks
Quantifying the performance impact illustrates why COPY FROM STDIN shines. Here are some headline metrics from the PostgreSQL docs and community tests:
- 5-10x faster than INSERT – COPY minimizes parsing and can batch write data instead of constant small commits with INSERT. This allows reaching ingest speeds of over 1 million rows/sec even on commodity hardware.
- 90%+ reduction in writes – By batching data then writing sequentially, COPY FROM needs far fewer index/storage updates compared to row-by-row INSERTs touching indexes on each row.
- >2x better than \copy from files – Piping from stdin avoids file roundtrips and allows direct data streaming into PostgreSQL. Performance degrades if using file-based COPY instead.
- 75% lower CPU – COPY utilizes available parallelism, storage bandwidth and sequential access much better. INSERT trashes CPU caches probing indexes constantly.
For context, this 50 million row import benchmark highlights the difference:
| Operation | Time |
|---|---|
| INSERT | 4h 26min |
| COPY FROM STDIN | 14min 30s |
Bottom line – For inserts up into the 100s of millions of rows, COPY FROM STDIN offers an order-of-magnitude speedup over standard approaches.
Alternative Methods for Data Ingestion
While COPY FROM STDIN hits the sweet spot for transactional style data loading, other techniques have different use cases:
- External tables – Provide SQL interface exposing data in external files instead of importing. Useful for reads without needing storage.
- Foreign data wrappers – Allow querying external data sources directly. Avoid data duplication while benefiting from PostgreSQL processing.
- Streaming replication – Tools like Kafka stream changes from databases then replicate to analytics systems. Better for incremental appends.
- EL/ETL packages – Specialized extract-load tools for capturing changed data from Upserts or CDC logs are popular. More coding than COPY though.
My recommendation – leverage COPY FROM STDIN where possible then turn to alternatives like data integration pipelines for subsequent updating. This balances performance and flexibility.
Recommendations for Effective Data Loading
Based on the tunings and comparisons covered, here is my recommended approach:
-
Prep the data files – Clean, transform and validate input outside PostgreSQL first. Ensures only compliant data gets imported.
-
Stage uncompressed – Staging the final data files avoids compress/decompress overhead.
-
Bulk copy with verification – Run COPY FROM STDIN then verify success with
SELECT COUNT(*). Restart or rollback on any errors. -
Rebuild indexes/constraints – Add indexes and constraints appropriate for analytics after loading. Avoid slowing COPY down mid-ingestion.
-
Periodic refresh – Append new batches on a routine schedule via COPY FROM STDIN or tail file changes.
This tiered process works smoothly from initial seeding through ongoing change data capture, while minimizing production load impact.
Conclusion
As we‘ve explored, leveraging PostgreSQL‘s COPY FROM STDIN capability offers big wins for ingesting large datasets into analytical databases. By streaming data directly into tables, it eclipses standard approaches like INSERT statements and external tables in both simplicity and performance.
Carefully following PostgreSQL best practices allows efficiently loading even 100TB+ scale datasets for modern data warehousing needs. Just beware of potential footguns like lock contention or invalid input data.
For one-time bulk populating or migrating databases, COPY FROM STDIN is hard to beat, allowing easily sustaining multi-gigabyte/sec import speeds. This future-proofs analytical systems for sensors data, high frequency logs and other mammoth workloads.
Give COPY FROM STDIN a try with your next big data migration or nightly ETL pipeline and see firsthand the difference it makes. Properly tuned, it can massively accelerate analytics throughput while requiring only standard PostgreSQL – no need for exotic extensions.


