GH-38503: [Go][Parquet] Style improvement for using ArrowColumnWriter by mapleFU · Pull Request #38581 · apache/arrow

mapleFU · 2023-11-04T09:32:58Z

Rationale for this change

Currently, ArrowColumnWriter seems not having bug. But the usage is confusing. For nested type, ArrowColumnWriter should considering the logic below:

  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4

The left column is the column in root of arrow::Schema, the parquet itself only stores Leaf node,
so, the column id for parquet is list at right.

In the ArrowColumnWriter, the final argument is the LeafIdx in parquet, so, writer should considering
using leafIdx. Also, it need a LeafCount API for getting the leaf-count here.

What changes are included in this PR?

Style enhancement for LeafCount, leafIdx and usage for ArrowColumnWriter

Are these changes tested?

no

Are there any user-facing changes?

no

Closes: [Go][Parquet] Trouble using the C++ reader to read a Parquet file written with the Go writer #38503

github-actions · 2023-11-04T09:33:24Z

⚠️ GitHub issue #38503 has been automatically assigned in GitHub to PR creator.

mapleFU · 2023-11-04T09:33:50Z

@zeroshade @tschaub

I've updated the sample here. Also, we may need to check the type for writer. Any advice is welcomed.

tschaub · 2023-11-04T10:50:07Z

Thank you for looking into this and finding the issue, @mapleFU.

My assumption was that the last argument to the pqarrow.NewArrowColumnWriter function was the Arrow column index instead of the Parquet column index (or logical instead of physical, not sure which terms to use). And the current tests support this assumption.

Maybe the question for @zeroshade is whether the tests are correct (last arg is logical/Arrow col index) or the source is correct (last arg is physical/Parquet col index).

This has been a source of uncertainty for me when col is used in the documentation. I look forward to better understanding which interpretation is meant (or is field a better term to differentiate?).

mapleFU · 2023-11-04T16:55:41Z

Sorry for late replying because I'm playing games :-)

Schema is an important part when using parquet, because parquet only store leaf-nodes. You can also refer to the code here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h#L142-L153

  /// \brief Read the given columns into a Table
  ///
  /// The indicated column indices are relative to the internal representation
  /// of the parquet table. For instance :
  /// 0 foo.bar
  ///       foo.bar.baz           0
  ///       foo.bar.baz2          1
  ///   foo.qux                   2
  /// 1 foo2                      3
  /// 2 foo3                      4
  ///
  /// i=0 will read foo.bar.baz, i=1 will read only foo.bar.baz2 and so on.
  /// Only leaf fields have indices; foo itself doesn't have an index.
  /// To get the index for a particular leaf field, one can use
  /// manifest().schema_fields to get the top level fields, and then walk the
  /// tree to identify the relevant leaf fields and access its column_index.
  /// To get the total number of leaf fields, use FileMetadata.num_columns().

The code above is for reader, but the reason is same. The left is the "arrow column number", the right is "parquet leaf column number", here we need a mapping for them.

So, in case above, the parquet "real" schema would be:

foo.bar.baz
foo.bar.baz2
foo.qux
foo2
foo3

You can also read some comments in https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader.h . It would helps a lot.

ArrowColumnWriter needs mapping the arrow writer (like writer for foo ) to underlying leaf writer (writer for foo.bar.baz and foo.qux ). In this case, the foo writer will write 3 leaf node, and next time, the ArrowColumnWriter should accept 3 here.

mapleFU · 2023-11-08T08:41:49Z

Ping @zeroshade
Would you mind take a look? Or otherway is better here?

zeroshade · 2023-11-13T16:57:55Z

Sorry for the delay here, I've been away at a conference all last week. I've been catching up on my notifications.

Reading through the code a bunch, it looks like the intent is that the passed in column index (originally named col, renamed to leafColIdx in this PR) should be the index of the first physical/parquet column index that will be used by the writer. For example, if you look inside the ArrowColumnWriter.Write method, it uses colIdx + leafIdx to tell the BufferedRowGroupWriter which physical/parquet column to write. This is also verified by looking at FileWriter.WriteColumnChunked which contains fw.colIdx += acw.leafCount, showing that we bump the column index we are constructing the column writer with by the number of leaf columns we found.

So it looks like you've got the right interpretation here I believe.

zeroshade

LGTM

tschaub · 2023-11-13T23:11:52Z

@zeroshade - I agree that the implementation of the pqarrow.NewArrowColumnWriter function expects the last argument to be the physical/Parquet column index.

What was confusing to me and caused uncertainty was that the pqarrow package is mostly about working with Arrow, and the name of the function NewArrowColumnWriter suggests this is about Arrow columns. Then I looked at the tests for confirmation and saw that they use the Arrow table column index in calling this function:

arrow/go/parquet/pqarrow/encode_arrow_test.go

Lines 148 to 152 in bff5fb9

    
           for i := int64(0); i < tbl.NumCols(); i++ { 
        
           	acw, err := pqarrow.NewArrowColumnWriter(tbl.Column(int(i)).Data(), 0, tbl.NumRows(), manifest, srgw, int(i)) 
        
           	require.NoError(t, err) 
        
           	require.NoError(t, acw.Write(ctx)) 
        
           }

So while I agree that the changes here improve the name of that last argument and make it easier to get the leaf index with the newly exported arrowColumnWriter.LeafCount() method, I'm not sure this makes for the most intuitive API. Since the function is called with an Arrow column chunk and the number of Arrow rows, wouldn't it make sense to call it with the Arrow column index as well (as done in the tests above)?

tschaub · 2023-11-14T00:08:23Z

I was envisioning a change something like this: main...tschaub:arrow:arrow-col

zeroshade · 2023-11-14T01:22:42Z

Hmm, originally the intent when i wrote that was essentially sort of a "destination column index" type thing, which is why it referred to the physical parquet index to start at. It also simplified the code in general. Typically most people would likely not be creating an ArrowColumnWriter directly but rather would be just using the pqarrow.Writer itself and writing the records as is (with the ArrowColumnWriter being used internally in most situations).

Personally, I prefer the usage of the last arg being the physical parquet column to start writing at, which seems more intuitive to me in general if you look at it in terms of being a "destination to start writing at" which makes the ArrowColumnWriter more versatile and able to be used regardless of what the arrow schema was (at least in my mind).

What do you think @mapleFU of @tschaub's comments? I'm fine with being outvoted on this if others find it more intuitive for it to be the opposite. You get to be a tie breaker 😄

mapleFU · 2023-11-14T03:50:29Z

So while I agree that the changes here improve the name of that last argument and make it easier to get the leaf index with the newly exported arrowColumnWriter.LeafCount() method, I'm not sure this makes for the most intuitive API. Since the function is called with an Arrow column chunk and the number of Arrow rows, wouldn't it make sense to call it with the Arrow column index as well (as done in the tests above)?

This is not the most intuitive API. Generally parquet/arrow module need to convert arrow record batch to a parquet leaves, this need to be done in parquet/arrow

func (fw *FileWriter) WriteColumnChunked(data *arrow.Chunked, offset, size int64) error {
	acw, err := NewArrowColumnWriter(data, offset, size, fw.manifest, fw.rgw, fw.colIdx)
	if err != nil {
		return err
	}
	fw.colIdx += acw.leafCount
	return acw.Write(fw.ctx)
}

This API is much more easily, it's used in WriteTable. Also, C++ has a WriteRecordBatch, we use it as in-house parquet writer. I think it's easy to porting and use it.

ArrowColumnWriter is neccessary for writing parquet, but I think it's a bit hack to using it as a export API. User need to always maintaining the offsets, array types etc.

zeroshade · 2023-11-14T15:46:24Z

@mapleFU Do you think it would make more sense to simply change it and no longer expose the ArrowColumnWriter directly, and direct users to using the FileWriter apis instead?

The Go has an equivalent to WriteRecordBatch which is the Write(rec arrow.Record) error method, which conforms to the array.RecordWriter interface

mapleFU · 2023-11-14T15:50:08Z

Yeah, I think ArrowColumnWriter is a bit "internal" because user of it need to understanding the mapping... It's ok for unnested type, but when user has nested type, it would be so hard to understand.

Also, I think we'd better adding some type checking for parquet writer. I think some Array level checking will not harm the performance

zeroshade · 2023-11-14T16:32:36Z

@mapleFU would you wanna take a stab at updating this PR accordingly?

@tschaub thoughts on the conversation so far?

mapleFU · 2023-11-14T16:54:46Z

Updated, fell free to edit it.

zeroshade · 2023-11-14T20:26:02Z

@mapleFU In order to make the ArrowColumnWriter no longer exported, you need to change the first character to be lowercase

mapleFU · 2023-11-15T02:51:46Z

Hmmm I know you meaning but I think maybe separate or close this patch is better? Since already some other modules using NewArrowColumnWriter. I don't know if this is ok to just set it private directly.

tschaub · 2023-11-15T03:30:39Z

@mapleFU Do you think it would make more sense to simply change it and no longer expose the ArrowColumnWriter directly, and direct users to using the FileWriter apis instead?

I also think this makes sense. I was only using the ArrowColumnWriter after digging around the tests. But I have since switched to using a fileWriter.WriteColumnChunked() with a pqarrow.FileWriter. I think this is in line with what is being suggested here.

mapleFU · 2023-11-15T04:02:03Z

@zeroshade I found there're some tests using ArrowColumnWriter in other module, like encoding, should I also remove them or how to fix them?

tschaub · 2023-11-15T04:06:00Z

Feel free to cherry pick if this is what you had in mind: main...tschaub:arrow:inernal-arrow-column-writer

I don't have PARQUET_TEST_DATA set up, but other tests pass.

mapleFU · 2023-11-15T04:42:35Z

Nice, let matt decide that!

zeroshade · 2023-11-15T15:38:15Z

Nice! Thanks both of you! This looks good to me as is, so I'm gonna merge this and then go review #38727

This makes it so the Arrow column writer is not exported from the `pqarrow` package. This follows up on comments from #38581. * Closes: #38503 Authored-by: Tim Schaub <tim@planet.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

conbench-apache-arrow · 2023-11-15T19:03:08Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit dfdebdd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…Writer (apache#38581) ### Rationale for this change Currently, `ArrowColumnWriter` seems not having bug. But the usage is confusing. For nested type, `ArrowColumnWriter` should considering the logic below: ``` /// 0 foo.bar /// foo.bar.baz 0 /// foo.bar.baz2 1 /// foo.qux 2 /// 1 foo2 3 /// 2 foo3 4 ``` The left column is the column in root of `arrow::Schema`, the parquet itself only stores Leaf node, so, the column id for parquet is list at right. In the `ArrowColumnWriter`, the final argument is the LeafIdx in parquet, so, writer should considering using `leafIdx`. Also, it need a `LeafCount` API for getting the leaf-count here. ### What changes are included in this PR? Style enhancement for `LeafCount`, `leafIdx` and usage for `ArrowColumnWriter` ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38503 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

…pache#38727) This makes it so the Arrow column writer is not exported from the `pqarrow` package. This follows up on comments from apache#38581. * Closes: apache#38503 Authored-by: Tim Schaub <tim@planet.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>

Update: style improvement for go util

4a67007

mapleFU requested a review from zeroshade as a code owner November 4, 2023 09:32

github-actions bot added Component: Go awaiting review Awaiting review labels Nov 4, 2023

zeroshade approved these changes Nov 13, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Nov 13, 2023

tschaub mentioned this pull request Nov 15, 2023

GH-38503: [Go][Parquet] Make the arrow column writer internal #38727

Merged

zeroshade merged commit dfdebdd into apache:main Nov 15, 2023

zeroshade removed the awaiting merge Awaiting merge label Nov 15, 2023

mapleFU deleted the go-parquet/improve-style branch November 15, 2023 15:41

Conversation

mapleFU commented Nov 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 4, 2023

Uh oh!

mapleFU commented Nov 4, 2023

Uh oh!

tschaub commented Nov 4, 2023

Uh oh!

mapleFU commented Nov 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Nov 8, 2023

Uh oh!

zeroshade commented Nov 13, 2023

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

tschaub commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tschaub commented Nov 14, 2023

Uh oh!

zeroshade commented Nov 14, 2023

Uh oh!

mapleFU commented Nov 14, 2023

Uh oh!

zeroshade commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Nov 14, 2023

Uh oh!

zeroshade commented Nov 14, 2023

Uh oh!

mapleFU commented Nov 14, 2023

Uh oh!

zeroshade commented Nov 14, 2023

Uh oh!

mapleFU commented Nov 15, 2023

Uh oh!

tschaub commented Nov 15, 2023

Uh oh!

mapleFU commented Nov 15, 2023

Uh oh!

tschaub commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Nov 15, 2023

Uh oh!

zeroshade commented Nov 15, 2023

Uh oh!

conbench-apache-arrow bot commented Nov 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mapleFU commented Nov 4, 2023 •

edited

Loading

mapleFU commented Nov 4, 2023 •

edited

Loading

tschaub commented Nov 13, 2023 •

edited

Loading

zeroshade commented Nov 14, 2023 •

edited

Loading

tschaub commented Nov 15, 2023 •

edited

Loading