GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema by milesgranger · Pull Request #39781 · apache/arrow

milesgranger · 2024-01-24T13:29:50Z

I think the hash, especially for FileMetaData could be better, maybe just use return of __repr__, even though that won't include row group info?

Rationale for this change

Helpful for dependent projects.

What changes are included in this PR?

Impl __hash__ for ParquetSchema and FileMetaData

Are these changes tested?

Yes

Are there any user-facing changes?

Supports hashing metadata:

In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540

Closes: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema #39780

github-actions · 2024-01-24T13:30:15Z

⚠️ GitHub issue #39780 has been automatically assigned in GitHub to PR creator.

AlenkaF · 2024-01-25T12:28:10Z

Hi Miles! 👋
The CI failures have just been fixed so would go away if you rebase.

Will look at the PR shortly 👍

AlenkaF · 2024-01-25T14:51:17Z

I think the hash, especially for FileMetaData could be better, maybe just use return of repr, even though that won't include row group info?

I am not sure if row group info is necessary - do not have much experience with hash(). But naively I would think using __repr__ would be enough. @raulcd what do you think?

pitrou · 2024-02-06T16:31:40Z

python/pyarrow/_parquet.pyx

I think it would be simpler, and probably faster, to serialize the metadata using self._metadata.WriteTo and then hash the resulting buffer contents.

But then you probably want to make pa.Buffer hashable as well:

>>> stream = pa.BufferOutputStream() >>> stream.write(b"xxx") 3 >>> stream.getvalue() <pyarrow.Buffer address=0x7f84af208000 size=3 is_cpu=True is_mutable=True> >>> stream.getvalue() == b"xxx" True >>> hash(stream.getvalue()) Traceback (most recent call last): Cell In[9], line 1 hash(stream.getvalue()) TypeError: unhashable type: 'pyarrow.lib.Buffer'

Note that it can be as simple as taking the hash of a bytes copy:

>>> hash(stream.getvalue().to_pybytes()) 5574314894425565867

But, even then, this approach will not be very performant while a hash is supposed to compute quickly:

>>> md = pq.read_metadata("/home/antoine/arrow/dev/python/lineitem-snappy.pq") >>> md <pyarrow._parquet.FileMetaData object at 0x7f87befc19e0> created_by: parquet-cpp-arrow version 15.0.0-SNAPSHOT num_columns: 16 num_rows: 2568534 num_row_groups: 21 format_version: 2.6 serialized_size: 32440 >>> md.__reduce__() (<cyfunction _reconstruct_filemetadata at 0x7f87c02fea80>, (<pyarrow.Buffer address=0x7f88082086c0 size=32440 is_cpu=True is_mutable=True>,)) >>> hash(md.__reduce__()[1][0].to_pybytes()) 8779236492631648618 >>> %timeit hash(md.__reduce__()[1][0].to_pybytes()) 93.6 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It will be far worse with this PR's approach, since generating the dict is extremely costly:

>>> %timeit md.to_dict() 2.95 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(note: milliseconds, as opposed to microseconds above)

As a general basis, a __hash__ does not need to take all contents exhaustively into account if doing so is too expensive. It's ok to limit ourselves to a smaller, faster subset. For example only the schema:

>>> %timeit hash(repr(md.schema)) 6.59 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

You can also convert to a Arrow schema:

>>> %timeit md.schema.to_arrow_schema() 4.94 µs ± 8.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Unfortunately, pa.Schema.__hash__ is too slow currently:

>>> s = md.schema.to_arrow_schema() >>> %timeit hash(s) 12.7 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

This is because pa.Schema.__hash__ takes the slow Python way, while the C++ Schema has a useful and fast fingerprint method. (same comment for pa.Field and `pa.DataType)

Of course you don't have to do all this here, but a __hash__ function that takes milliseconds to execute is definitely not a good idea.

Thanks for confirming my suspicion the current approach was abysmal. Just wasn't sure what belongs in there. At least schema but seemed like filemetadata was a superset in some way of schema, so ought to include a few other things, thus added a few more attributes in the followup; please feel free to suggest any changes there. :)

Timings I have for 4505938 are:

In [1]: import pyarrow.parquet as pq In [2]: meta = pq.read_metadata("../../lineitem-stats-no-index.snappy.parquet") In [3]: schema = meta.schema.to_arrow_schema() In [4]: %timeit hash(meta) 4.34 µs ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [5]: %timeit hash(schema) 8.34 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

pitrou · 2024-02-06T17:10:09Z

python/pyarrow/_parquet.pyx

Two things:

object.__hash__ hashes by identity, which is precisely what you want to avoid here:

>>> s = md.schema.to_arrow_schema() >>> s2 = md.schema.to_arrow_schema() >>> object.__hash__(s) 8763812462659 >>> object.__hash__(s2) 8763812583391

frombytes is not necessary, you can just hash a bytes object directly, which should be slightly faster

>>> s.to_string() 'l_orderkey: int32\nl_partkey: int32\nl_suppkey: int32\nl_linenumber: int32\nl_quantity: double\nl_extendedprice: double\nl_discount: double\nl_tax: double\nl_returnflag: string\nl_linestatus: string\nl_shipdate: timestamp[us]\nl_commitdate: timestamp[us]\nl_receiptdate: timestamp[us]\nl_shipinstruct: string\nl_shipmode: string\nl_comment: string' >>> b = s.to_string().encode() >>> %timeit hash(b) 56.4 ns ± 3.83 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) >>> %timeit hash(b.decode()) 242 ns ± 2.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Addressed in 4505938

pitrou · 2024-02-06T17:10:41Z

python/pyarrow/tests/parquet/test_metadata.py

Ok, but should also check that two equal but distinct instances also hash equally.

Done in 4505938, thanks.

pitrou

LGTM except that you need to rebase IMHO

pitrou · 2024-02-12T11:46:07Z

cpp/submodules/parquet-testing

It seems this is a spurious change, can you update the submodules on your branch?

Also, can you rebase on latest git main?

Done and done. 👍

pitrou · 2024-02-12T13:30:05Z

Thank you @milesgranger !

conbench-apache-arrow · 2024-02-12T16:05:11Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 56d1ec1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them.

…d ParquetSchema (apache#39781) I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info? ### Rationale for this change Helpful for dependent projects. ### What changes are included in this PR? Impl `__hash__` for `ParquetSchema` and `FileMetaData` ### Are these changes tested? Yes ### Are there any user-facing changes? Supports hashing metadata: ```python In [1]: import pyarrow.parquet as pq In [2]: f = pq.ParquetFile('test.parquet') In [3]: hash(f.metadata) Out[3]: 4816453453708427907 In [4]: hash(f.metadata.schema) Out[4]: 2300988959078172540 ``` * Closes: apache#39780 Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: Python awaiting review Awaiting review labels Jan 24, 2024

milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from 14d64a1 to 55f2808 Compare January 25, 2024 12:29

pitrou reviewed Feb 6, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 6, 2024

pitrou reviewed Feb 6, 2024

View reviewed changes

github-actions bot added the Component: C++ label Feb 12, 2024

pitrou requested changes Feb 12, 2024

View reviewed changes

milesgranger added 2 commits February 12, 2024 13:19

Support hashing for FileMetaData

07df50b

Impl and include schema in hash

f0f6f0d

milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from 4505938 to 85451e8 Compare February 12, 2024 12:20

Refactor hash impl for FileMetaData and Schema

4fbe7a0

milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from db43e9d to 4fbe7a0 Compare February 12, 2024 12:35

github-actions bot removed the Component: C++ label Feb 12, 2024

pitrou approved these changes Feb 12, 2024

View reviewed changes

pitrou merged commit 56d1ec1 into apache:main Feb 12, 2024

pitrou removed the awaiting committer review Awaiting committer review label Feb 12, 2024

milesgranger deleted the milesgranger/hash-for-filemetadata branch February 12, 2024 13:29

milesgranger mentioned this pull request Feb 12, 2024

Implement hashing for FileMetaData object from Arrow dask/dask-expr#703

Closed

Conversation

milesgranger commented Jan 24, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Jan 24, 2024

Uh oh!

AlenkaF commented Jan 25, 2024

Uh oh!

AlenkaF commented Jan 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milesgranger Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Feb 12, 2024

Uh oh!

conbench-apache-arrow bot commented Feb 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

milesgranger commented Jan 24, 2024 •

edited by github-actions bot

Loading

milesgranger Feb 12, 2024 •

edited

Loading