util/parquet: add crdb metadata to parquet files#103131
util/parquet: add crdb metadata to parquet files#103131craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
238de49 to
351524e
Compare
This change adds an option to the writer which allows the caller to write arbitrary kv metadata to parquet files. This is useful for testing purposes. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None
f0b170a to
3ba03db
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
For more context.Context, see https://github.com/jayshrivastava/cockroach/tree/parquet-cdc, which integrates the new parquet writer with changefeeds using ReadFile and custom metadata.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/util/parquet/writer.go
Outdated
|
|
||
| // deserializeIntArray deserializes an integer sting in the format "23 2 32 43 | ||
| // 32" to an array of ints. | ||
| func deserializeIntArray(s string) ([]uint32, error) { |
There was a problem hiding this comment.
are these test only functions? If so.. maybe move to testutils?
pkg/util/parquet/writer.go
Outdated
| // TODO(#99028): maxRowGroupSize should be a configuration Option, along with | ||
| // compression schemes, allocator, batch size, page size etc | ||
| func NewWriter(sch *SchemaDefinition, sink io.Writer, opts ...Option) (*Writer, error) { | ||
| schMeta, err := MakeSchemaMetadata(sch) |
There was a problem hiding this comment.
should we always add this metdata? Or should it be only added in tests?
There was a problem hiding this comment.
Good point. I'll make it a test only thing (probably by adding a WithCRDBReaderMeta option). It can take an enum for any additional meta (ex. primary key cols which are needed in changefeed tests).
Previously, the test utils used to read parquet files would require the writer as an argument. The main reason the writer was required is that the writer contained crdb-specific type information which could be used to decode raw data until crdb datums. With this change, the writer is updated to write this crdb-specific type information to the parquet file in its metadata. The reader is updated to the read type information from the file metadata. There is a new test utility function `ReadFile(parquetFile string)` which can be used to read all datums from a parquet file without providing any additional type information. The function also returns metadata since it is possible for users of the `Writer` to write arbitrary metadata and such users may need this metadata in testing. Informs: cockroachdb#99028 Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071 Release note: None
3ba03db to
7a36a77
Compare
|
|
||
| // ReadFileAndVerifyDatums asserts that a parquet file's metadata matches the | ||
| // metadata from the writer and its data matches writtenDatums. | ||
| func ReadFileAndVerifyDatums( |
There was a problem hiding this comment.
move this function to testutils?
|
bors r=miretskiy |
|
Build failed: |
|
bors retry |
|
Build succeeded: |
util/parquet: add option to write kv metadata to files
This change adds an option to the writer which allows the caller
to write arbitrary kv metadata to parquet files. This is useful
for testing purposes.
Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None
util/parquet: remove dependency on writer to read parquet files
Previously, the test utils used to read parquet files
would require the writer as an argument. The main reason
the writer was required is that the writer contained
crdb-specific type information which could be used to
decode raw data until crdb datums.
With this change, the writer is updated to write this
crdb-specific type information to the parquet file in
its metadata. The reader is updated to the read type
information from the file metadata. There is a new
test utility function
ReadFile(parquetFile string)which can be used to read all datums from a parquet
file without providing any additional type information.
The function also returns metadata since it is possible
for users of the
Writerto write arbitrary metadataand such users may need this metadata in testing.
Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None