changefeedccl: add full support for the parquet format#104528
changefeedccl: add full support for the parquet format#104528craig[bot] merged 7 commits intocockroachdb:masterfrom
Conversation
10f2156 to
7e56e82
Compare
Previously, the option `key_in_value` was disallowed with `format=parquet`. This change allows these settings to be used together. Note that `key_in_value` is enabled by default with `cloudstorage` sinks and `format=parquet` is only allowed with cloudstorage sinks, so `key_in_value` is enabled for parquet by default. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
When using `format=parquet`, an additional column is produced to indicate the type of operation corresponding to the row: create, update, or delete. This change adds coverage for this in unit testing. Additionally, the test modified in this change is made more simple by reducing the number of rows and different types because this complexity is unnecessary as all types are tested within the util/parquet package already. Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None Epic: None
Previously, the test utilities in `util/parquet` would not reconstruct tuples read from files with their labels. This change updates the package to do so. This is required for testing in users of this package such as CDC. Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
cc9914b to
883706a
Compare
883706a to
fab7b69
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, 2 of 2 files at r3, 2 of 5 files at r4, 1 of 1 files at r5, 4 of 4 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
pkg/ccl/changefeedccl/parquet.go line 74 at r7 (raw file):
const parquetOptUpdatedTimestampColName = metaSentinel + changefeedbase.OptUpdatedTimestamps const parquetOptMVCCTimestampColName = metaSentinel + changefeedbase.OptMVCCTimestamps const parquetOptDiffColName = metaSentinel + changefeedbase.OptDiff
should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?
pkg/ccl/changefeedccl/parquet.go line 174 at r7 (raw file):
return nil }) valueBuilder, err := json.NewFixedKeysObjectBuilder(keys)
you definitely want to construct this value builder once.
pkg/ccl/changefeedccl/parquet.go line 180 at r7 (raw file):
if err := prevRow.ForEachColumn().Datum(func(d tree.Datum, col cdcevent.ResultColumn) error { j, err := tree.AsJSON(d, sessiondatapb.DataConversionConfig{}, time.UTC)
Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?
fab7b69 to
37a91ee
Compare
This change adds support for the `diff` changefeed options when using `format=parquet`. Enabling `diff` also adds support for CDC Transformations with parquet. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
This change adds support for the `end_time` changefeed options when using `format=parquet`. No significant code changes were needed to enable this feature. Closes: cockroachdb#103129 Closes: cockroachdb#99028 Epic: CRDB-27372 Release note (enterprise change): Changefeeds now officially support the parquet format at specificiation version 2.6. It is only usable with the cloudstorage sink. The syntax to use parquet is like the following: `CREATE CHANGEFEED FOR foo INTO `...` WITH format=parquet` It supports all standard changefeed options and features including CDC transformations, except it does not support the `topic_in_value` option.
Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
This change forces all tests, including tests for `diff` and `end_time` to run with the `cloudstorage` sink and `format=parquet` where possible. Informs: cockroachdb#103129 Informs: cockroachdb#99028 Epic: CRDB-27372 Release note: None
37a91ee to
d5d1c88
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/parquet.go line 74 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
should it be "prev" instead of diff? Or "_before" -- which is pretty standard name too?
Renamed to "before".
pkg/ccl/changefeedccl/parquet.go line 174 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
you definitely want to construct this value builder once.
Done.
pkg/ccl/changefeedccl/parquet.go line 180 at r7 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Why do we need to convert each datum to JSON? Couldn't we just valueBuilder.Set(col, d)?
Looks like tree.Datum does not implement json.JSON. We do a similar thing in the JSON encoder
cockroach/pkg/ccl/changefeedccl/encoder_json.go
Lines 212 to 216 in 07101ad
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 1 of 6 files at r8, 1 of 1 files at r9, 4 of 4 files at r11.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)
|
bors r=miretskiy |
|
Build succeeded: |
changefeedccl: support key_in_value with parquet format
Previously, the option
key_in_valuewas disallowed withformat=parquet. This change allows these settings to beused together. Note that
key_in_valueis enabled bydefault with
cloudstoragesinks andformat=parquetisonly allowed with cloudstorage sinks, so
key_in_valueisenabled for parquet by default.
Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: add test coverage for parquet event types
When using
format=parquet, an additional column is produced toindicate the type of operation corresponding to the row: create,
update, or delete. This change adds coverage for this in unit
testing.
Additionally, the test modified in this change is made more simple
by reducing the number of rows and different types because this
complexity is unnecessary as all types are tested within the
util/parquet package already.
Informs: #99028
Epic: CRDB-27372
Release note: None
Epic: None
util/parquet: support tuple labels in util/parquet testutils
Previously, the test utilities in
util/parquetwould not reconstructtuples read from files with their labels. This change updates the
package to do so. This is required for testing in users of this
package such as CDC.
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: support diff option with parquet format
This change adds support for the
diffchangefeedoptions when using
format=parquet. Enablingdiffalso addssupport for CDC Transformations with parquet.
Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None
changefeedccl: support end_time option with parquet format
This change adds support for the
end_timechangefeedoptions when using
format=parquet. No significant codechanges were needed to enable this feature.
Closes: #103129
Closes: #99028
Epic: CRDB-27372
Release note (enterprise change): Changefeeds now officially
support the parquet format at specificiation version 2.6.
It is only usable with the cloudstorage sink.
The syntax to use parquet is like the following:
CREATE CHANGEFEED FOR foo INTO...WITH format=parquetIt supports all standard changefeed options and features
including CDC transformations, except it does not support the
topic_in_valueoption.changefeedccl: use parquet with 50% probability in nemeses test
Informs: #99028
Epic: CRDB-27372
Release note: None
do not merge: force parquet cloud storage tests
This change forces all tests, including tests for
diffandend_timeto run with the
cloudstoragesink andformat=parquetwhere possible.Informs: #103129
Informs: #99028
Epic: CRDB-27372
Release note: None