Skip to content

GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema#39781

Merged
pitrou merged 3 commits intoapache:mainfrom
milesgranger:milesgranger/hash-for-filemetadata
Feb 12, 2024
Merged

GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema#39781
pitrou merged 3 commits intoapache:mainfrom
milesgranger:milesgranger/hash-for-filemetadata

Conversation

@milesgranger
Copy link
Copy Markdown
Contributor

@milesgranger milesgranger commented Jan 24, 2024

I think the hash, especially for FileMetaData could be better, maybe just use return of __repr__, even though that won't include row group info?

Rationale for this change

Helpful for dependent projects.

What changes are included in this PR?

Impl __hash__ for ParquetSchema and FileMetaData

Are these changes tested?

Yes

Are there any user-facing changes?

Supports hashing metadata:

In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39780 has been automatically assigned in GitHub to PR creator.

@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Jan 25, 2024

Hi Miles! 👋
The CI failures have just been fixed so would go away if you rebase.

Will look at the PR shortly 👍

@milesgranger milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from 14d64a1 to 55f2808 Compare January 25, 2024 12:29
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Jan 25, 2024

I think the hash, especially for FileMetaData could be better, maybe just use return of repr, even though that won't include row group info?

I am not sure if row group info is necessary - do not have much experience with hash(). But naively I would think using __repr__ would be enough. @raulcd what do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simpler, and probably faster, to serialize the metadata using self._metadata.WriteTo and then hash the resulting buffer contents.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then you probably want to make pa.Buffer hashable as well:

>>> stream = pa.BufferOutputStream()
>>> stream.write(b"xxx")
3
>>> stream.getvalue()
<pyarrow.Buffer address=0x7f84af208000 size=3 is_cpu=True is_mutable=True>
>>> stream.getvalue() == b"xxx"
True
>>> hash(stream.getvalue())
Traceback (most recent call last):
  Cell In[9], line 1
    hash(stream.getvalue())
TypeError: unhashable type: 'pyarrow.lib.Buffer'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it can be as simple as taking the hash of a bytes copy:

>>> hash(stream.getvalue().to_pybytes())
5574314894425565867

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, even then, this approach will not be very performant while a hash is supposed to compute quickly:

>>> md = pq.read_metadata("/home/antoine/arrow/dev/python/lineitem-snappy.pq")
>>> md
<pyarrow._parquet.FileMetaData object at 0x7f87befc19e0>
  created_by: parquet-cpp-arrow version 15.0.0-SNAPSHOT
  num_columns: 16
  num_rows: 2568534
  num_row_groups: 21
  format_version: 2.6
  serialized_size: 32440
>>> md.__reduce__()
(<cyfunction _reconstruct_filemetadata at 0x7f87c02fea80>,
 (<pyarrow.Buffer address=0x7f88082086c0 size=32440 is_cpu=True is_mutable=True>,))
>>> hash(md.__reduce__()[1][0].to_pybytes())
8779236492631648618
>>> %timeit hash(md.__reduce__()[1][0].to_pybytes())
93.6 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

It will be far worse with this PR's approach, since generating the dict is extremely costly:

>>> %timeit md.to_dict()
2.95 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(note: milliseconds, as opposed to microseconds above)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general basis, a __hash__ does not need to take all contents exhaustively into account if doing so is too expensive. It's ok to limit ourselves to a smaller, faster subset. For example only the schema:

>>> %timeit hash(repr(md.schema))
6.59 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

You can also convert to a Arrow schema:

>>> %timeit md.schema.to_arrow_schema()
4.94 µs ± 8.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Unfortunately, pa.Schema.__hash__ is too slow currently:

>>> s = md.schema.to_arrow_schema()
>>> %timeit hash(s)
12.7 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

This is because pa.Schema.__hash__ takes the slow Python way, while the C++ Schema has a useful and fast fingerprint method. (same comment for pa.Field and `pa.DataType)

Of course you don't have to do all this here, but a __hash__ function that takes milliseconds to execute is definitely not a good idea.

Copy link
Copy Markdown
Contributor Author

@milesgranger milesgranger Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming my suspicion the current approach was abysmal. Just wasn't sure what belongs in there. At least schema but seemed like filemetadata was a superset in some way of schema, so ought to include a few other things, thus added a few more attributes in the followup; please feel free to suggest any changes there. :)

Timings I have for 4505938 are:

In [1]: import pyarrow.parquet as pq

In [2]: meta = pq.read_metadata("../../lineitem-stats-no-index.snappy.parquet")

In [3]: schema = meta.schema.to_arrow_schema()

In [4]: %timeit hash(meta)
4.34 µs ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit hash(schema)
8.34 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 6, 2024
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. object.__hash__ hashes by identity, which is precisely what you want to avoid here:
>>> s = md.schema.to_arrow_schema()
>>> s2 = md.schema.to_arrow_schema()
>>> object.__hash__(s)
8763812462659
>>> object.__hash__(s2)
8763812583391
  1. frombytes is not necessary, you can just hash a bytes object directly, which should be slightly faster
>>> s.to_string()
'l_orderkey: int32\nl_partkey: int32\nl_suppkey: int32\nl_linenumber: int32\nl_quantity: double\nl_extendedprice: double\nl_discount: double\nl_tax: double\nl_returnflag: string\nl_linestatus: string\nl_shipdate: timestamp[us]\nl_commitdate: timestamp[us]\nl_receiptdate: timestamp[us]\nl_shipinstruct: string\nl_shipmode: string\nl_comment: string'
>>> b = s.to_string().encode()
>>> %timeit hash(b)
56.4 ns ± 3.83 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit hash(b.decode())
242 ns ± 2.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4505938

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but should also check that two equal but distinct instances also hash equally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 4505938, thanks.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except that you need to rebase IMHO

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is a spurious change, can you update the submodules on your branch?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you rebase on latest git main?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done. 👍

@milesgranger milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from 4505938 to 85451e8 Compare February 12, 2024 12:20
@milesgranger milesgranger force-pushed the milesgranger/hash-for-filemetadata branch from db43e9d to 4fbe7a0 Compare February 12, 2024 12:35
@pitrou pitrou merged commit 56d1ec1 into apache:main Feb 12, 2024
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Feb 12, 2024
@milesgranger milesgranger deleted the milesgranger/hash-for-filemetadata branch February 12, 2024 13:29
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 12, 2024

Thank you @milesgranger !

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 56d1ec1.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…d ParquetSchema (apache#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: apache#39780

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][Parquet] Support hashing for FileMetaData and ParquetSchema

3 participants