-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Improve IMPORT performance for larger unsorted data sets of #79615
Description
Is your feature request related to a problem? Please describe.
When attempting to run IMPORT using un-sorted Avro data, the IMPORT runs very slowly for larger data sets (e.g., > 1 TiB).
For example, running IMPORT on a 16 x 16vCPU cluster with 64 x 24 GiB un-sorted Avro files (1.5 TiB) takes about 30 hours, which averages ~2.8 MiB/node/sec.
Smaller IMPORTs, such as running IMPORT on a 10 x 16vCPU cluster with 20 x 5.7 GiB un-sorted Avro files (114 GiB) takes about 35 minutes, which average ~17 MiB/node/sec .
Describe the solution you'd like
Ideally, there would be higher throughput and predictable scaling for larger un-sorted Avro data sets.
Describe alternatives you've considered
Sorting the Avro data within and across files was considered but does not fit into the existing workflow / datasource.
Additional context
https://github.com/cockroachlabs/support/issues/1464
Jira issue: CRDB-14945
Work in progress
- kv/bulk: parallelize sending SSTs due to range bounds #79967
- storage,kvserver: Improve SST collision checking for wide SSTs #81062
Investigations & prior art
- storage: Use prefix iteration for CheckSSTConflicts #73514 (mostly reverted below, considering resuscitating)
- storage: Use optimized SeekGE in CheckSSTConflicts #73981 (mostly reverts above)
Epic CRDB-16237