GH-45073: Update encryption test files to correct invalid repetition levels#65
Conversation
|
Have you changed the compression type? It seems that all files have at least 80% size reduction. |
No, they were using Snappy before and so I changed the writing code to keep using Snappy. I'm not sure why there's a size reduction. I'll see if I can figure out why. |
|
It looks like the size difference is caused by this change which stopped writing column metadata at the end of chunks: apache/arrow#43428 If I revert that commit then the file sizes match exactly. |
|
Thanks for the investigation! That makes sense. |
|
BTW, these files are used by parquet-java: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java. We need to make sure that these tests won't fail after migrating to the new files. |
|
I tested updating parquet-java to reference my fork of parquet-testing with the new commit SHA and it looks like it worked fine: https://github.com/adamreeve/parquet-java/actions/runs/12407983545 Relevant lines from the log: Details |
There was a problem hiding this comment.
Would you mind change a "corrupt" file to the bad_file?
Also cc @ggershinsky
Just to confirm I've understood what you mean, do you mean add a file to the |
|
You're right, lets move forward |
|
OK thanks. I'm happy to add a |
I think so, we can do it separately |
As a result of apache/arrow#45073, the test files in the parquet-testing repo were updated (apache/parquet-testing#65). This led to having to also update the corresponding tests here too similar to apache/arrow#45074
See apache/arrow#45073. This fixes the encryption test files generated by Arrow C++ so that the repetition levels are correct. I tried to avoid changing other properties of these files like the compression and encodings used.
After applying the change to fix the repetition level (apache/arrow#45074), I generated the files by running the parquet-encryption-test tests with the following changes: