-
Notifications
You must be signed in to change notification settings - Fork 186
Improve Snowflake bind array inserts operations using adbc_ingest() #1322
Copy link
Copy link
Closed
Description
Problem: Bulk inserting a pyarrow table using adbc_ingest creates a bind array insert operation for every record batch. It takes 3 hours to insert 36 million rows across 3,000 record batches created by the pyarrow multithreaded CSV reader using adbc_ingest().
If you reorg the data into 36 record batches of 1 million rows each the same adbc_ingest() call only takes 3 minutes..
Sample code:
new_recordbatch = {column: [] for column in my_data.schema.names}
new_batches = []
rows = 0
for batch in my_data.to_batches():
for column in my_data.schema.names:
new_recordbatch[column].append(batch[column])
if rows < 1000000:
rows += batch.num_rows
else:
print(rows)
new_arrays = []
for column in my_data.schema.names:
new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
new_batches.append(pa.RecordBatch.from_arrays(new_arrays, schema=my_data.schema))
new_recordbatch = {column: [] for column in my_data.schema.names}
rows = 0
if rows > 0:
new_arrays = []
for column in my_data.schema.names:
new_arrays.append(pa.concat_arrays(new_recordbatch[column]))
new_batches.append(pa.RecordBatch.from_arrays(new_arrays, schema=my_data.schema))
new_table = pa.Table.from_batches(new_batches)
cursor = adbc_conn.cursor()
cursor.adbc_ingest(table_name="xyz", data=new_table, mode="append")
cursor.execute("commit")
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels