Skip to content

Improve Snowflake bind array inserts operations using adbc_ingest() #1322

@davlee1972

Description

@davlee1972

Problem: Bulk inserting a pyarrow table using adbc_ingest creates a bind array insert operation for every record batch. It takes 3 hours to insert 36 million rows across 3,000 record batches created by the pyarrow multithreaded CSV reader using adbc_ingest().

If you reorg the data into 36 record batches of 1 million rows each the same adbc_ingest() call only takes 3 minutes..

Sample code:

new_recordbatch = {column: [] for column in my_data.schema.names}
new_batches = []
rows = 0

for batch in my_data.to_batches():
    for column in my_data.schema.names:
        new_recordbatch[column].append(batch[column])
    if rows < 1000000:
        rows += batch.num_rows
    else:
        print(rows)
        new_arrays = []
        for column in my_data.schema.names:
            new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
        new_batches.append(pa.RecordBatch.from_arrays(new_arrays, schema=my_data.schema))
        new_recordbatch = {column: [] for column in my_data.schema.names}
        rows = 0

if rows > 0:
    new_arrays = []
    for column in my_data.schema.names:
        new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
    new_batches.append(pa.RecordBatch.from_arrays(new_arrays, schema=my_data.schema))

new_table = pa.Table.from_batches(new_batches)

cursor = adbc_conn.cursor()
cursor.adbc_ingest(table_name="xyz", data=new_table, mode="append")
cursor.execute("commit")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions