Skip to content

Conversation

@clee704
Copy link
Contributor

@clee704 clee704 commented May 8, 2024

Rationale for this change

Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.

What changes are included in this PR?

Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.

Are these changes tested?

Yes, unit tests are added

Are there any user-facing changes?

Yes.

  • Users can read or write key-value metadata for column chunks with Parquet C++.
  • Users can read key-value metadata for column chunks with PyArrow.
  • parquet-reader tool prints key-value metadata in column chunks when --print-key-value-metadata option is used.

@github-actions
Copy link

github-actions bot commented May 8, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@clee704 clee704 changed the title Support reading/writing key-value metadata from/to ColumnMetaData [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@github-actions
Copy link

github-actions bot commented May 8, 2024

⚠️ GitHub issue #41579 has been automatically assigned in GitHub to PR creator.

@clee704 clee704 changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
@clee704 clee704 marked this pull request as ready for review May 8, 2024 05:00
@clee704 clee704 requested a review from wgtmac as a code owner May 8, 2024 05:00
@mapleFU mapleFU changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea general LGTM

cc @pitrou @emkornfield @wgtmac

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 12, 2024
@mapleFU mapleFU requested a review from pitrou May 12, 2024 05:46
@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet.

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

It seems that parquet-mr does not use it yet.

Can parquet-mr reads that?

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard..

@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I agree. It should not block us from implementing this.

@pitrou
Copy link
Member

pitrou commented May 13, 2024

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

@clee704
Copy link
Contributor Author

clee704 commented May 29, 2024

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

Created apache/parquet-testing#49

@mapleFU
Copy link
Member

mapleFU commented Aug 4, 2024

Would check CPython build

 Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                    ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:18: undeclared name not builtin: pyarrow_wrap_metadata

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                                                                          ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:72: Cannot convert 'shared_ptr[const CKeyValueMetadata]' to Python object
  [17/136] Compiling Cython CXX source for _dataset...
  performance hint: pyarrow/_dataset.pyx:1947:35: Exception check after calling 'GetResultValue[shared_ptr[CRandomAccessFile]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRandomAccessFile]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3343:41: Exception check after calling 'GetResultValue[shared_ptr[CRecordBatch]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRecordBatch]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3387:34: Exception check after calling 'GetResultValue[CTaggedRecordBatch]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatch]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3784:42: Exception check after calling 'GetResultValue[CTaggedRecordBatchIterator]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatchIterator]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  [18/136] Building CXX object CMakeFiles/arrow_python.dir/pyarrow/src/arrow/python/datetime.cc.o

@mapleFU mapleFU requested a review from pitrou August 6, 2024 02:32
@mapleFU
Copy link
Member

mapleFU commented Aug 7, 2024

@wgtmac @pitrou Do you have any more comments here? Lets move forward now

@mapleFU
Copy link
Member

mapleFU commented Aug 13, 2024

Would merge this week if no negative comments

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the C++ side

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

@pitrou Would you mind take a look at this? I'm planning to merge this pr this week if no negative comments


TEST_F(TestInt32Writer, WriteKeyValueMetadata) {
auto writer = this->BuildWriter();
writer->AddKeyValueMetadata(KeyValueMetadata::Make({"hello"}, {"world"}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where AddKeyValueMetadata is called twice?

}

template <typename Metadata>
std::shared_ptr<KeyValueMetadata> CopyKeyValueMetadata(const Metadata& source) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this FromThriftKeyValueMetadata?

@pitrou pitrou self-requested a review August 15, 2024 14:34
@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

Sorry, I approved while the two comments should probably be addressed before merging this.

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

Thanks all, merged!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2767dc5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

Comment on lines +31 to +37
@pytest.fixture(scope='module')
def parquet_test_datadir():
result = os.environ.get('PARQUET_TEST_DATA')
if not result:
raise RuntimeError('Please point the PARQUET_TEST_DATA environment '
'variable to the test data directory')
return pathlib.Path(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time we introduce a pyarrow test that requires this to be set up. Do we want to require that strictly, or should we skip the test if the env variable is not set?

In any case this caused a failure in one of the nightly crossbow builds which doesn't have this env variable set (python-emscriptem, https://github.com/ursacomputing/crossbow/actions/runs/10463078918/job/28974494645#step:7:15654)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is ok for me but I prefer read that 🤔

@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 20, 2024
jorisvandenbossche added a commit that referenced this pull request Aug 22, 2024
…43786)

### Rationale for this change

Starting with #41580, the pyarrow tests now also rely on a file in the parquet-testing submodule. And the path to that directory is controlled by `PARQUET_TEST_DATA`, which appears to be set wrongly in the wheel test scripts, causing all wheel builds to fail at the moment.
 
* GitHub Issue: #43785

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
kou pushed a commit that referenced this pull request Sep 18, 2024
…on emscripten (#43906)

### Rationale for this change

The following PR:
- #41580

Made mandatory for a test the requirement to have `PARQUET_TEST_DATA` env defined.

This is currently not available from `python_test_emscripten.sh` as we require to mount the filesystem for both Node and ChromeDriver.

### What changes are included in this PR?

Skip the test that requires `PARQUET_TEST_DATA` for emscripten.

### Are these changes tested?

Via archery

### Are there any user-facing changes?

No
* GitHub Issue: #43905
* GitHub Issue: #43868

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…(#43786)

### Rationale for this change

Starting with apache/arrow#41580, the pyarrow tests now also rely on a file in the parquet-testing submodule. And the path to that directory is controlled by `PARQUET_TEST_DATA`, which appears to be set wrongly in the wheel test scripts, causing all wheel builds to fail at the moment.
 
* GitHub Issue: #43785

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
…on emscripten (#43906)

### Rationale for this change

The following PR:
- apache/arrow#41580

Made mandatory for a test the requirement to have `PARQUET_TEST_DATA` env defined.

This is currently not available from `python_test_emscripten.sh` as we require to mount the filesystem for both Node and ChromeDriver.

### What changes are included in this PR?

Skip the test that requires `PARQUET_TEST_DATA` for emscripten.

### Are these changes tested?

Via archery

### Are there any user-facing changes?

No
* GitHub Issue: #43905
* GitHub Issue: #43868

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants