GH-41909: [C++] Add arrow::ArrayStatistics#43273
Conversation
|
Unity builds seems to be failing to link on the |
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
If it's meant to be usable with Parquet, then we also need to represent string/binary values (Parquet can store max/min stats of BYTE_ARRAY columns).
There was a problem hiding this comment.
You're right.
It's a discussion point. See also: #42133 (comment)
We didn't have a good idea for string/binary values. But I have an idea now. How about adding both of std::string and std::string_view for them? If we can store min/max values, we can use std::string_view to avoid copy of them. Otherwise, we can use std::string for them.
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
Why "Buffer"? I would call it ValueType or similar.
There was a problem hiding this comment.
In the first PoC implementation, we have TypedArrayStatistics<T>::{min,max}(). I wanted to use min/max for them instead of instance variables. So I added Buffer/_buffer suffix for instance variables and their type.
But TypedArrayStatistics<T> from the first implementation. See also: #42133 (comment)
So we can remove Buffer/_buffer suffix for instance variables and their type.
I've renamed ElementBufferType to ValueType and {min,max}_buffer to {min,max}.
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
This is superfluous as the default member visibility is public in structs.
It's caused by not inlining some symbols in |
|
If nobody objects this, I'll merge this in this week. |
|
I didn't fully catch the things before, how does this able to react with C interface? Does |
The "react with C interface" refers the "[DISCUSS] Statistics through the C data interface" thread https://lists.apache.org/thread/z0jz2bnv61j7c6lbk7lympdrs49f69cx , right? We will not use metadata for it. We will send an array that uses the following schema separately for it: https://lists.apache.org/thread/8j7sc0lwl3f9c77srpvkt83r43ktwwfr So we don't store |
|
No more comments? |
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
A naive question: how does this ordering being defined?
Parquet has this definition https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L953-L1001 . The current definition would be great but problem would raise when float is in?
There was a problem hiding this comment.
Oh, I missed it...
It seems that enum class Order {kUndefined, kTypeDefinedOrder}; is suitable for Parquet but it's not enough generic...
Anyway, I'll add Float16, float and double to ValueType. I missed them too.
There was a problem hiding this comment.
Yeah I think there're not some differences on int/unsigned int. Float might able to define a IEEE754 total ordered float like NAN - -inf - numbers - inf like: https://doc.rust-lang.org/std/primitive.f32.html#method.total_cmp
There was a problem hiding this comment.
Thanks for additional information.
Can we defer the ordering as a separated issue and discuss it on the issue?
There was a problem hiding this comment.
Looks good to me since it's still unstable 🤔
cpp/src/arrow/array/statistics.h
Outdated
There was a problem hiding this comment.
If not set, it means "might not exact"?
There was a problem hiding this comment.
Well, if it's unknown, then it's not exact. The third state is not useful IMHO.
There was a problem hiding this comment.
Hmm. It may be exact.
is_max_value_exact in the Parquet's definition is optional:
arrow/cpp/src/parquet/parquet.thrift
Lines 279 to 282 in d078d5c
We may need std::optional to support Parquet statistics. Can we keep this std::optional for now?
There was a problem hiding this comment.
I think the reason it's optional in Parquet is simply for compatibility with older files. @wgtmac Is it right?
There was a problem hiding this comment.
Well, if it's unknown, then it's not exact. The third state is not useful IMHO.
I agree with @pitrou. We can simply mark it as not exact if this field is missing.
See apacheGH-42133 how to use this for Apache Parquet statistics.
2c2e697 to
2c60782
Compare
|
LGTM |
|
I'll merge this. |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 39af73f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
| std::variant<bool, int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, | ||
| uint64_t, util::Float16, float, double, std::string, std::string_view>; |
There was a problem hiding this comment.
This is too much IMO. uint64_t and int64_t is enough to cover all integers. And double is enough to cover all floating pointing statistics. string_view confuses things a lot because this is supposed to have value semantics and not pointer semantics.
| ArrayStatistics() = default; | ||
| ~ArrayStatistics() = default; |
There was a problem hiding this comment.
Are these required at all? Perhaps for ARROW_EXPORT?
There was a problem hiding this comment.
Ah, they may not be needed. I didn't care about them.
Let's try it: #43579
Rationale for this change
We're discussion API on the mailing list https://lists.apache.org/thread/kcpyq9npnh346pw90ljwbg0wxq6hwxxh and GH-41909.
If we have
arrow::ArrayStatistics, we can attach statistics read from Apache Parquet toarrow::Arrays.This only includes
arrow::ArrayStatistics. See GH-42133 how to usearrow::ArrayStatiticsfor Apache Parquet's statistics.What changes are included in this PR?
This only adds
arrow::ArrayStatisticsand its tests.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.