GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter#38581
GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter#38581zeroshade merged 1 commit intoapache:mainfrom
Conversation
|
|
|
I've updated the sample here. Also, we may need to check the type for writer. Any advice is welcomed. |
|
Thank you for looking into this and finding the issue, @mapleFU. My assumption was that the last argument to the Maybe the question for @zeroshade is whether the tests are correct (last arg is logical/Arrow col index) or the source is correct (last arg is physical/Parquet col index). This has been a source of uncertainty for me when |
|
Sorry for late replying because I'm playing games :-) Schema is an important part when using parquet, because parquet only store leaf-nodes. You can also refer to the code here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h#L142-L153 The code above is for reader, but the reason is same. The left is the "arrow column number", the right is "parquet leaf column number", here we need a mapping for them. So, in case above, the parquet "real" schema would be: You can also read some comments in https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h . It would helps a lot.
|
|
Ping @zeroshade |
|
Sorry for the delay here, I've been away at a conference all last week. I've been catching up on my notifications. Reading through the code a bunch, it looks like the intent is that the passed in column index (originally named So it looks like you've got the right interpretation here I believe. |
|
@zeroshade - I agree that the implementation of the What was confusing to me and caused uncertainty was that the arrow/go/parquet/pqarrow/encode_arrow_test.go Lines 148 to 152 in bff5fb9 So while I agree that the changes here improve the name of that last argument and make it easier to get the leaf index with the newly exported |
|
I was envisioning a change something like this: main...tschaub:arrow:arrow-col |
|
Hmm, originally the intent when i wrote that was essentially sort of a "destination column index" type thing, which is why it referred to the physical parquet index to start at. It also simplified the code in general. Typically most people would likely not be creating an ArrowColumnWriter directly but rather would be just using the Personally, I prefer the usage of the last arg being the physical parquet column to start writing at, which seems more intuitive to me in general if you look at it in terms of being a "destination to start writing at" which makes the ArrowColumnWriter more versatile and able to be used regardless of what the arrow schema was (at least in my mind). What do you think @mapleFU of @tschaub's comments? I'm fine with being outvoted on this if others find it more intuitive for it to be the opposite. You get to be a tie breaker 😄 |
This is not the most intuitive API. Generally This API is much more easily, it's used in
|
|
@mapleFU Do you think it would make more sense to simply change it and no longer expose the The Go has an equivalent to |
|
Yeah, I think Also, I think we'd better adding some type checking for parquet writer. I think some Array level checking will not harm the performance |
|
Updated, fell free to edit it. |
|
@mapleFU In order to make the |
|
Hmmm I know you meaning but I think maybe separate or close this patch is better? Since already some other modules using |
I also think this makes sense. I was only using the |
|
@zeroshade I found there're some tests using ArrowColumnWriter in other module, like encoding, should I also remove them or how to fix them? |
|
Feel free to cherry pick if this is what you had in mind: main...tschaub:arrow:inernal-arrow-column-writer I don't have |
|
Nice, let matt decide that! |
|
Nice! Thanks both of you! This looks good to me as is, so I'm gonna merge this and then go review #38727 |
|
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit dfdebdd. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…Writer (apache#38581) ### Rationale for this change Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type, `ArrowColumnWriter` should considering the logic below: ``` /// 0 foo.bar /// foo.bar.baz 0 /// foo.bar.baz2 1 /// foo.qux 2 /// 1 foo2 3 /// 2 foo3 4 ``` The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node, so, the column id for parquet is list at right. In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here. ### What changes are included in this PR? Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter` ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38503 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…pache#38727) This makes it so the Arrow column writer is not exported from the `pqarrow` package. This follows up on comments from apache#38581. * Closes: apache#38503 Authored-by: Tim Schaub <tim@planet.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Rationale for this change
Currently,
ArrowColumnWriterseems not having bug. But the usage is confusing. For nested type,ArrowColumnWritershould considering the logic below:The left column is the column in root of
arrow::Schema, the parquet itself only stores Leaf node,so, the column id for parquet is list at right.
In the
ArrowColumnWriter, the final argument is the LeafIdx in parquet, so, writer should consideringusing
leafIdx. Also, it need aLeafCountAPI for getting the leaf-count here.What changes are included in this PR?
Style enhancement for
LeafCount,leafIdxand usage forArrowColumnWriterAre these changes tested?
no
Are there any user-facing changes?
no