Skip to content

fix(go/adbc/driver/snowflake): Records dropped on ingestion when empty batch is present#1866

Merged
zeroshade merged 8 commits intoapache:mainfrom
joellubi:sf-ingest-dropped-rows
May 21, 2024
Merged

fix(go/adbc/driver/snowflake): Records dropped on ingestion when empty batch is present#1866
zeroshade merged 8 commits intoapache:mainfrom
joellubi:sf-ingest-dropped-rows

Conversation

@joellubi
Copy link
Copy Markdown
Member

@joellubi joellubi commented May 15, 2024

Reproduces and fixes: #1847

Parquet files with empty row groups are valid per the spec, but Snowflake does not currently handle them properly. To mitigate this we buffer writes to the parquet file so that a row group is not written until some amount of data has been received.

The CheckedAllocator was enabled for all tests as part of this fix, which detected a leak in the BufferWriter that was fixed in: apache/arrow#41698.

There was an unrelated test failure that surfaced once the CheckedAllocator was enabled which had to do with casting decimals of certain precision. The fix is included in this PR as well.

@zeroshade
Copy link
Copy Markdown
Member

If you change line 344 of bulk_ingestion.go from pqWriter.Write(rec) to pqWriter.WriteBuffered(rec) that will fix the issue by eliminating the empty row group. That seems to be a sufficient fix for our end here.

@joellubi joellubi changed the title fix(go/driver/snowflake): Records dropped on ingestion when empty batch is present fix(go/adbc/driver/snowflake): Records dropped on ingestion when empty batch is present May 16, 2024

"github.com/apache/arrow-adbc/go/adbc"
"github.com/apache/arrow-adbc/go/adbc/driver/internal"
"github.com/apache/arrow-adbc/go/adbc/driver/snowflake"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already imported as driver on the line below this, that's why the CI is failing

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zeroshade. I had fixed that locally but have been dealing with another issue causing tests to fail. The bufferedwriter was tripping the checkedallocator test whereas the regular writer previously wasn't.

I was able to plug the original leak and have this all working but it will require a separate PR to arrow which I will push up. I did slightly expand scope here however because as a part of the testing I enabled the checked allocator for all snowflake tests and have been chasing down a few unrelated issues as well. I'll push up some commits so you can see exactly what I mean.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for chasing all this down

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tests should pass once apache/arrow#41698 is merged and pulled in.

There is still just one test remaining with a memory leak, which is TestNonIntDecimalLowPrecision. It seems rely on compute kernels to cast decimal values to floats. This involves buffer preallocation in some cases which doesn't appear to get released properly. I haven't gotten to the bottom of this yet but if anyone else is more familiar with it or has the time, I could use a second set of eyes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all the tests passed on the last run. Do you still need a second set of eyes or is this ready?

Copy link
Copy Markdown
Member Author

@joellubi joellubi May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, running locally all the failures cleared up except for the one I mentioned above. I took a look and it seems Github Actions is skipping the snowflake tests because the SNOWFLAKE_URI is not defined.

I'll have some time to look at it today but a second set of eyes would still be a great help.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I found the issue and pushed up the fix. We were missing a Release on a fairly narrow code path for certain precisions. All snowflake tests pass locally for me now.

@joellubi joellubi marked this pull request as ready for review May 21, 2024 10:53
}

func (suite *SnowflakeTests) SetupSuite() {
func (suite *SnowflakeTests) SetupTest() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the switch from SetupSuite (once for all the tests) to SetupTest (once for each test)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows all tests to use the SetupDriver and TeardownDriver methods defined on the DriverQuirks. This came to my attention because only the tests in validation.go which were using Setup/TeardownDriver caught the memory leak. This allows us to have the same test setup for the driver tests as well. Without this, the driver was getting created with one checked allocator and each test would create another one, preventing a full view of memory allocations.

Copy link
Copy Markdown
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good to me, just a few nitpicks.

@zeroshade zeroshade merged commit 75e3927 into apache:main May 21, 2024
@lidavidm lidavidm added this to the ADBC Libraries 13 milestone Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

adbc_ingest() is dropping rows in Snowflake

3 participants