Skip to content

Improve IMPORT performance for larger unsorted data sets of #79615

@jonstjohn

Description

@jonstjohn

Is your feature request related to a problem? Please describe.

When attempting to run IMPORT using un-sorted Avro data, the IMPORT runs very slowly for larger data sets (e.g., > 1 TiB).

For example, running IMPORT on a 16 x 16vCPU cluster with 64 x 24 GiB un-sorted Avro files (1.5 TiB) takes about 30 hours, which averages ~2.8 MiB/node/sec.

Smaller IMPORTs, such as running IMPORT on a 10 x 16vCPU cluster with 20 x 5.7 GiB un-sorted Avro files (114 GiB) takes about 35 minutes, which average ~17 MiB/node/sec .

Describe the solution you'd like

Ideally, there would be higher throughput and predictable scaling for larger un-sorted Avro data sets.

Describe alternatives you've considered

Sorting the Avro data within and across files was considered but does not fit into the existing workflow / datasource.

Additional context

https://github.com/cockroachlabs/support/issues/1464

Jira issue: CRDB-14945


Work in progress

Investigations & prior art

Epic CRDB-16237

Metadata

Metadata

Assignees

Labels

C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-storageStorage Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions