Skip to content

Conversation

@gruuya
Copy link
Contributor

@gruuya gruuya commented Jun 16, 2023

Instead of buffering the entire file in memory, stream the incoming bits to a local temp file, and then scan it when appending to the target table.

Memory implications (using a 160MB parquet file):

  • Load-file-in-memory approach (current state)
slika - With streaming to disk (this PR) slika

In addition, bump DataFussion to post-26 version to pick up schema coercion in Parquet reader, and thus close #179. Also add some tests demonstrating the new flexibility of the upload endpoint (implicit type casting, column skipping etc.).

Stream the incoming file bits to a local temp file instead of buffering it all in memory.
@gruuya gruuya requested a review from mildbyte June 16, 2023 12:39
@gruuya gruuya force-pushed the upload-refinements branch from 5a1ef15 to d1dab6f Compare June 16, 2023 20:41
@gruuya gruuya force-pushed the upload-refinements branch from d1dab6f to c9fb93f Compare June 16, 2023 20:43
@gruuya gruuya merged commit b6fa348 into main Jun 19, 2023
@gruuya gruuya deleted the upload-refinements branch June 19, 2023 07:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Parquet upload experience (schema coercion, error reporting)

3 participants