Skip to content

ARROW-5889: [C++][Parquet] Add property to indicate origin from converted type to TimestampLogicalType#4856

Closed
tpboudreau wants to merge 1 commit intoapache:masterfrom
tpboudreau:ARROW-5889
Closed

ARROW-5889: [C++][Parquet] Add property to indicate origin from converted type to TimestampLogicalType#4856
tpboudreau wants to merge 1 commit intoapache:masterfrom
tpboudreau:ARROW-5889

Conversation

@tpboudreau
Copy link
Copy Markdown
Contributor

This patch:

  • adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
  • prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
  • changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Also tested it locally and can confirm this now properly reads older files in pyarrow.

@tpboudreau
Copy link
Copy Markdown
Contributor Author

Thanks @jorisvandenbossche for testing it out. When you say it works on older files, did you use the files attached to the Jira issue by the reporter? I've been considering adding a test using those as fixtures. Thanks again.

@jorisvandenbossche
Copy link
Copy Markdown
Member

Yes, I checked with those files (as well as something I wrote myself with an older version). I certainly think we should add such tests reading older files (I just opened https://issues.apache.org/jira/browse/ARROW-5915 for that).
But, if we do it like that, we should probably write a script that generates such files (so that it is reproducible) instead of adding the files attached to the issue. And also, we should discuss how to do a more fully integrated forward+backwards compatibility testing (although IMO that does not prevent adding some basic back compat tests)

@tpboudreau
Copy link
Copy Markdown
Contributor Author

Great. Thanks for testing the reporter's files.

Thanks also for opening that issue for compatibility testing -- I generally agree with you -- I'll comment further on that Jira issue when I get the chance.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 12, 2019

@tpboudreau this needs to be rebased now, let me know if you need help

@tpboudreau
Copy link
Copy Markdown
Contributor Author

Rebased and RFR (I'll monitor the TravisCI build after it kicks off).

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #4856 into master will decrease coverage by 11.39%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4856       +/-   ##
===========================================
- Coverage   87.08%   75.68%    -11.4%     
===========================================
  Files         883       58      -825     
  Lines      116941     3735   -113206     
  Branches     1418        0     -1418     
===========================================
- Hits       101837     2827    -99010     
+ Misses      14742      908    -13834     
+ Partials      362        0      -362
Impacted Files Coverage Δ
cpp/src/arrow/csv/chunker-test.cc
cpp/src/parquet/column_page.h
cpp/src/parquet/bloom_filter-test.cc
cpp/src/arrow/array/builder_decimal.cc
go/arrow/memory/memory_avx2_amd64.go
cpp/src/arrow/io/test-common.h
cpp/src/arrow/util/int-util-test.cc
cpp/src/parquet/column_scanner.cc
...p/src/gandiva/precompiled/epoch_time_point_test.cc
cpp/src/parquet/thrift.h
... and 810 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e70357...7b51bba. Read the comment docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true anymore -- I'll remove this comment and the is_serialized method

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, GitHub was hiding files from me so this is indeed what is going on in this patch. Will take a closer look

Copy link
Copy Markdown
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I see that the is_serialized override just allows us to round-trip the metadata faithfully. If a file is read to Arrow, then written back, then the logical type will be written (I think)

@wesm wesm closed this in 99b9702 Jul 13, 2019
kszucs pushed a commit that referenced this pull request Jul 16, 2019
…rted type to TimestampLogicalType

This patch:
 - adds a new boolean member isFromConvertedType() to the TimestampLogicalType class that is set to "true" if the LogicalType was created from a converted type of TIMESTAMP_MILLIS or TIMESTAMP_MICROS
- prevents writing the TimestampLogicalType in the Parquet schema if this new property is true
- changes the Arrow reader to ignore the isAdjustedToUTC() property of the TimestampLogicalType if the type annotation came from a converted type

Author: TP Boudreau <tpboudreau@gmail.com>

Closes #4856 from tpboudreau/ARROW-5889 and squashes the following commits:

458f063 <TP Boudreau> Add property showing converted type origin to TimestampLogicalType
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants