-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Improve COPY and batch insert performance #91831
Copy link
Copy link
Open
Labels
C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-queriesSQL Queries TeamSQL Queries Teammeta-issueContains a list of several other issues.Contains a list of several other issues.
Description
COPY/Batch insert performance master tracking bug
Batch inserts and COPY in CRDB are really slow (<1MB/s throughput). Below is a list of things we can do to speed up batch inserts and COPY in the front of the house. Once these are done further speedups will require work at the the KV/storage layers.
COPY:
- Make COPY skip optimizer
- Fix ParseAndRequire date parsing overhead
- copy: figure out how to reduce kv costs further #90743
- Implement vectorized insert for COPY copy: add vectorize insert support used solely by copy for now #98605
- Enable insert fast path for vectorized insert, ie implement simple FK checks via direct lookups if possible
- Enable vectorized insert w/ FK checks. We can't use vectorized insert if there's FK checks because vectorized support for bufferNode/scanBuffer is missing.
- Optimize maybeCondense overhead in kvcoord (https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1668549066508759)
- Optimize kv.Batch put for bulk puts (ie do smart request and result allocations). PR
- Explore using prefix compression at the KV client level for bulk puts
- Make COPY use (mostly) Datum free columnar Batchs and build column oriented KV batches instead of row oriented, ie put all the PK inserts first followed by each secondary index, idea is to put all the KV commands destined for the same range together in the kv.Batch.
- Optimize small copydata message pattern (already prototyped)
- Put COPY I/O on a separate goroutine from insertion execution (already prototyped)
- copy: figure out why copy doesn't scale #99320
Batch insert:
- Make single statement batch inserts use COPY fast path (optimizer pass through for VALUES), are there any common multi statement batch patterns we should try to optimize as well?
- sql: make prepared insert batches use bulk inserts #99319
- Make sure secondary indexes are in separate ranges to get range parallelism, this is already done by default sometimes, see make new/empty table range splitting consistent #91087
- Use separate goroutines to prepare and insert each index and column family, should we sort indexes too or let KV handle it?
IMPORT:
- Make import use datum free vectorized code
INDEX BACKFILL:
Does it make sense for index backfill to user columnar encoding code? Don't see why not...
Speculative, not currently being persued ideas:
- Can we just do an import if the table was created in the same transaction as the COPY? See sql: support COPY ... with FREEZE #85573
- Can batch inserts cheat if table was created in same transaction? Can we detect an empty table and use AddSST? Empty table/same transaction DDL tricks feel like wasted energy, most inserts won’t be into empty table and won’t be in same txn as create table.
- Can we take secondary indexes offline and update them lazily outside scope of transaction? Maybe just for non-unique indexes?
Jira issue: CRDB-21448
Epic CRDB-25320
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-sql-queriesSQL Queries TeamSQL Queries Teammeta-issueContains a list of several other issues.Contains a list of several other issues.
Type
Projects
Status
Backlog