Improve COPY and batch insert performance

COPY/Batch insert performance master tracking bug

Batch inserts and COPY in CRDB are really slow (<1MB/s throughput).   Below is a list of things we can do to speed up batch inserts and COPY in the front of the house.  Once these are done further speedups will require work at the the KV/storage layers.

COPY:

- [x] Make COPY [skip optimizer](https://github.com/cockroachdb/cockroach/issues/81731)
- [x] Fix ParseAndRequire [date parsing overhead ](https://github.com/cockroachdb/cockroach/issues/91834)
- [ ] https://github.com/cockroachdb/cockroach/issues/90743
- [X] Implement vectorized insert for COPY https://github.com/cockroachdb/cockroach/pull/98605
- [ ] Enable insert fast path for vectorized insert, ie implement simple FK checks via direct lookups if possible
- [ ] Enable vectorized insert w/ FK checks.  We can't use vectorized insert if there's FK checks because vectorized support for bufferNode/scanBuffer is missing. 
- [ ] Optimize maybeCondense overhead in kvcoord (https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1668549066508759)
- [x] Optimize kv.Batch put for bulk puts (ie do smart request and result allocations).  [PR](https://github.com/cockroachdb/cockroach/pull/96237)
- [ ] Explore using prefix compression at the KV client level for bulk puts
- [x] Make COPY use (mostly) Datum free columnar Batchs and build column oriented KV batches instead of row oriented, ie put all the PK inserts first followed by each secondary index, idea is to put all the KV commands destined for the same range together in the kv.Batch.
- [ ] Optimize [small copydata message](https://github.com/cockroachdb/cockroach/issues/93156) pattern (already prototyped)
- [ ] Put COPY I/O on a separate goroutine from insertion execution (already prototyped)
- [ ] https://github.com/cockroachdb/cockroach/issues/99320

Batch insert:
- [ ] Make single statement batch inserts use COPY fast path  (optimizer pass through for VALUES), are there any common multi statement batch patterns we should try to optimize as well?
- [ ] https://github.com/cockroachdb/cockroach/issues/99319
- [ ] Make sure secondary indexes are in separate ranges to get range parallelism, this is already done by default sometimes, see #91087
- [ ] Use separate goroutines to prepare and insert each index and column family, should we sort indexes too or let KV handle it?

IMPORT:
- [ ] Make import use datum free vectorized code

INDEX BACKFILL:

Does it make sense for index backfill to user columnar encoding code?    Don't see why not...

Speculative, not currently being persued ideas:

- Can we just do an import if the table was created in the same transaction as the COPY?  See #85573
- Can batch inserts cheat if table was created in same transaction?   Can we detect an empty table and use AddSST?  Empty table/same transaction DDL tricks feel like wasted energy, most inserts won’t be into empty table and won’t be in same txn as create table.
- Can we take secondary indexes offline and update them lazily outside scope of transaction?  Maybe just for non-unique indexes?

Jira issue: CRDB-21448

Epic CRDB-25320

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve COPY and batch insert performance #91831

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve COPY and batch insert performance #91831

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions