GH-35746: [Parquet][C++][Python] Switch default Parquet version to 2.6#36137
GH-35746: [Parquet][C++][Python] Switch default Parquet version to 2.6#36137jorisvandenbossche merged 9 commits intoapache:mainfrom
Conversation
|
|
|
Taking a quick look at some of the Python failures:
|
|
For C++, the test failures are in and I suppose those will also be updated to reflect the new defaults (nanoseconds are preserved), or updated to test both old default of 2.4 and new of 2.6 |
|
@jorisvandenbossche This was so helpful! Yeah, with the |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
The python changes all look good!
@mapleFU could you have a look at the C++ test changes?
At the moment it is mostly updating the tests for the new default, not fully sure how well the explicit versions are covered in other tests.
| _check_roundtrip(table, expected, version=ver, coerce_timestamps='us') | ||
|
|
||
| # TODO: after pyarrow allows coerce_timestamps='ns', tests like the | ||
| # following should pass ... |
There was a problem hiding this comment.
Could also be left for a follow-up, but we could now allow this (in _create_arrow_writer_properties in _parquet.pyx we would need to update the cython bindings to pass it through, I think on C++ side it's already implemented)
| arrow_writer_props.store_schema(); | ||
| if (inner_type->id() == ::arrow::Type::UINT32) { | ||
| writer_props.version(ParquetVersion::PARQUET_2_4); | ||
| writer_props.version(ParquetVersion::PARQUET_2_6); |
There was a problem hiding this comment.
The default writer_props created above with WriterProperties::Builder(); will now use 2_6 by default, so potentially this could be removed alltogether
| -1}, | ||
| {"timestamp(nanosecond, CET)", ::arrow::timestamp(::arrow::TimeUnit::NANO, "CET"), | ||
| LogicalType::Timestamp(true, LogicalType::TimeUnit::MICROS), ParquetType::INT64, | ||
| LogicalType::Timestamp(true, LogicalType::TimeUnit::NANOS), ParquetType::INT64, |
There was a problem hiding this comment.
Should we extract NANOS, and test with both 2_4 and 2_6? Since 2.4 might still be used for some time
|
The test in C++ looks ok. But I'm afraid that we might leak some test under 2.4, so maybe we can separate and test both 2.4 and 2.6? |
|
|
|
@github-actions crossbow submit -g integration |
|
Revision: 932de52 Submitted crossbow builds: ursacomputing/crossbow @ actions-34c5b677fb |
| field("ts:us", t_us), field("ts:ns", t_ns)}); | ||
| auto input_table = Table::Make(input_schema, {a_s, a_ms, a_us, a_ns}); | ||
|
|
||
| auto parquet_version_1_properties = ::parquet::default_writer_properties(); |
Co-authored-by: mwish <1506118561@qq.com>
I think this failure is not related, the AsofJoinTest regularly fails on main as well. See #36482 |
|
There might still be a few cases where the test coverage for all options can be improved (see some of the comments), but if needed, let's do that in a follow-up PR. |
|
Thanks @anjakefala! |
|
Conbench analyzed the 6 benchmark runs on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Change the default parquet version to 2.6.
Discussed in:
ML
Issue
Closes: [Parquet][C++][Python] Bump the default format version from 2.4 -> 2.6 #35746