GH-39780: [Python][Parquet] Support hashing for FileMetaData and ParquetSchema#39781
Conversation
|
|
|
Hi Miles! 👋 Will look at the PR shortly 👍 |
14d64a1 to
55f2808
Compare
I am not sure if row group info is necessary - do not have much experience with |
python/pyarrow/_parquet.pyx
Outdated
There was a problem hiding this comment.
I think it would be simpler, and probably faster, to serialize the metadata using self._metadata.WriteTo and then hash the resulting buffer contents.
There was a problem hiding this comment.
But then you probably want to make pa.Buffer hashable as well:
>>> stream = pa.BufferOutputStream()
>>> stream.write(b"xxx")
3
>>> stream.getvalue()
<pyarrow.Buffer address=0x7f84af208000 size=3 is_cpu=True is_mutable=True>
>>> stream.getvalue() == b"xxx"
True
>>> hash(stream.getvalue())
Traceback (most recent call last):
Cell In[9], line 1
hash(stream.getvalue())
TypeError: unhashable type: 'pyarrow.lib.Buffer'There was a problem hiding this comment.
Note that it can be as simple as taking the hash of a bytes copy:
>>> hash(stream.getvalue().to_pybytes())
5574314894425565867There was a problem hiding this comment.
But, even then, this approach will not be very performant while a hash is supposed to compute quickly:
>>> md = pq.read_metadata("/home/antoine/arrow/dev/python/lineitem-snappy.pq")
>>> md
<pyarrow._parquet.FileMetaData object at 0x7f87befc19e0>
created_by: parquet-cpp-arrow version 15.0.0-SNAPSHOT
num_columns: 16
num_rows: 2568534
num_row_groups: 21
format_version: 2.6
serialized_size: 32440
>>> md.__reduce__()
(<cyfunction _reconstruct_filemetadata at 0x7f87c02fea80>,
(<pyarrow.Buffer address=0x7f88082086c0 size=32440 is_cpu=True is_mutable=True>,))
>>> hash(md.__reduce__()[1][0].to_pybytes())
8779236492631648618
>>> %timeit hash(md.__reduce__()[1][0].to_pybytes())
93.6 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)It will be far worse with this PR's approach, since generating the dict is extremely costly:
>>> %timeit md.to_dict()
2.95 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)(note: milliseconds, as opposed to microseconds above)
There was a problem hiding this comment.
As a general basis, a __hash__ does not need to take all contents exhaustively into account if doing so is too expensive. It's ok to limit ourselves to a smaller, faster subset. For example only the schema:
>>> %timeit hash(repr(md.schema))
6.59 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)You can also convert to a Arrow schema:
>>> %timeit md.schema.to_arrow_schema()
4.94 µs ± 8.25 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)Unfortunately, pa.Schema.__hash__ is too slow currently:
>>> s = md.schema.to_arrow_schema()
>>> %timeit hash(s)
12.7 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)This is because pa.Schema.__hash__ takes the slow Python way, while the C++ Schema has a useful and fast fingerprint method. (same comment for pa.Field and `pa.DataType)
Of course you don't have to do all this here, but a __hash__ function that takes milliseconds to execute is definitely not a good idea.
There was a problem hiding this comment.
Thanks for confirming my suspicion the current approach was abysmal. Just wasn't sure what belongs in there. At least schema but seemed like filemetadata was a superset in some way of schema, so ought to include a few other things, thus added a few more attributes in the followup; please feel free to suggest any changes there. :)
Timings I have for 4505938 are:
In [1]: import pyarrow.parquet as pq
In [2]: meta = pq.read_metadata("../../lineitem-stats-no-index.snappy.parquet")
In [3]: schema = meta.schema.to_arrow_schema()
In [4]: %timeit hash(meta)
4.34 µs ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [5]: %timeit hash(schema)
8.34 µs ± 31.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
python/pyarrow/_parquet.pyx
Outdated
There was a problem hiding this comment.
Two things:
object.__hash__hashes by identity, which is precisely what you want to avoid here:
>>> s = md.schema.to_arrow_schema()
>>> s2 = md.schema.to_arrow_schema()
>>> object.__hash__(s)
8763812462659
>>> object.__hash__(s2)
8763812583391frombytesis not necessary, you can just hash a bytes object directly, which should be slightly faster
>>> s.to_string()
'l_orderkey: int32\nl_partkey: int32\nl_suppkey: int32\nl_linenumber: int32\nl_quantity: double\nl_extendedprice: double\nl_discount: double\nl_tax: double\nl_returnflag: string\nl_linestatus: string\nl_shipdate: timestamp[us]\nl_commitdate: timestamp[us]\nl_receiptdate: timestamp[us]\nl_shipinstruct: string\nl_shipmode: string\nl_comment: string'
>>> b = s.to_string().encode()
>>> %timeit hash(b)
56.4 ns ± 3.83 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
>>> %timeit hash(b.decode())
242 ns ± 2.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)There was a problem hiding this comment.
Ok, but should also check that two equal but distinct instances also hash equally.
pitrou
left a comment
There was a problem hiding this comment.
LGTM except that you need to rebase IMHO
cpp/submodules/parquet-testing
Outdated
There was a problem hiding this comment.
It seems this is a spurious change, can you update the submodules on your branch?
There was a problem hiding this comment.
Also, can you rebase on latest git main?
There was a problem hiding this comment.
Done and done. 👍
4505938 to
85451e8
Compare
db43e9d to
4fbe7a0
Compare
|
Thank you @milesgranger ! |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 56d1ec1. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…d ParquetSchema (apache#39781) I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info? ### Rationale for this change Helpful for dependent projects. ### What changes are included in this PR? Impl `__hash__` for `ParquetSchema` and `FileMetaData` ### Are these changes tested? Yes ### Are there any user-facing changes? Supports hashing metadata: ```python In [1]: import pyarrow.parquet as pq In [2]: f = pq.ParquetFile('test.parquet') In [3]: hash(f.metadata) Out[3]: 4816453453708427907 In [4]: hash(f.metadata.schema) Out[4]: 2300988959078172540 ``` * Closes: apache#39780 Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I think the hash, especially for
FileMetaDatacould be better, maybe just use return of__repr__, even though that won't include row group info?Rationale for this change
Helpful for dependent projects.
What changes are included in this PR?
Impl
__hash__forParquetSchemaandFileMetaDataAre these changes tested?
Yes
Are there any user-facing changes?
Supports hashing metadata:
FileMetaDataandParquetSchema#39780