fix(parquet/pqarrow): fix definition levels with non-nullable lists#325
Merged
zeroshade merged 2 commits intoapache:mainfrom Mar 24, 2025
Merged
fix(parquet/pqarrow): fix definition levels with non-nullable lists#325zeroshade merged 2 commits intoapache:mainfrom
zeroshade merged 2 commits intoapache:mainfrom
Conversation
lidavidm
approved these changes
Mar 22, 2025
Member
lidavidm
left a comment
There was a problem hiding this comment.
Is this the kind of thing where we should try to push out a new release ASAP to avoid "bad" files being written?
Is there a way to "fix" a bad file?
Member
Author
|
There isn't really a way to fix a "bad" file other than re-writing it unfortunately. It might be worth a patch release to fix this |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Related to apache/arrow#38503, it was found in apache/iceberg-go#357 that the Parquet file being written by pqarrow was incompatible with pyarrow in some testing. After some digging I determined the cause. So we should fix this to finally put #38503 to bed
What changes are included in this PR?
Fixing the computation of definition levels when writing from Arrow data with lists of non-nullable elements. Previously we were basing the
nullableInParenton whether it was a list or a map, assuming that the element was always nullable in a list. This, of course, is incorrect and we need to actually check the nullability of the list element to compute the correct definition level. This way we don't produce incongruous levels that are larger than what the max definition level should be.Are these changes tested?
Yes, I've updated the corresponding test, which wasn't actually testing a non-nullable element until now.
Are there any user-facing changes?
Only fixing writing of files.