Skip to content

backupccl: add more exportable crdb data types to exportParquet#75890

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
msbutler:butler-parquet-types
Feb 10, 2022
Merged

backupccl: add more exportable crdb data types to exportParquet#75890
craig[bot] merged 1 commit intocockroachdb:masterfrom
msbutler:butler-parquet-types

Conversation

@msbutler
Copy link
Copy Markdown
Collaborator

@msbutler msbutler commented Feb 2, 2022

Previously, Export Parquet could only export tables with a small subset of CRDB
data types. In this PR, I add support for all CRDB data types that avro
changefeeds support. Specifically, data types in the OID and Void families are
excluded as the former represents internal db objects and the ladder cannot be
parsed. Further CRDB data types can be supported on an as need
basis.

This PR also adds a test that validates that EXPORT PARQUET properly encodes
data in arbitrary CRDB tables with supported data types.

Fixes #67710

Release note (sql change): EXPORT PARQUET now supports all data types that avro
change feeds supports. Below I list the data type converstion from CRDB to
Parquet.

To learn about more about parquet data representation,
see https://github.com/apache/parquet-format

CRDB Type Family -> Parquet Type | Parquet Logical Type
Bool -> boolean | nil
String -> byte array | string
Collated String -> byte array | string
INet -> byte array | string
JSON -> byte array | json
Int -> int64 | nil
Float -> float64 | nil
Decimal -> byte array | decimal (note: scale and precision data are preserved
in the parquet file)
Uuid -> fixed length byte array (16 bytes) | nil
Bytes -> byte array | nil
Bit -> byte array | nil
Enum -> byte array | Enum
Box2d -> byte array | string
Geography -> byte array | nil
Geometry -> byte array | nil
Date -> byte array | string
Time -> int64 | time (note: microseconds since midnight)
TimeTz -> byte array | string
Interval -> byte array | string (specifically represented as ISO8601)
Timestamp -> byte array | string
TimestampTz -> byte array | string
Array -> encoded as a repeated field and each array value gets encoded by
pattern described above.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@msbutler msbutler force-pushed the butler-parquet-types branch from a03bd4e to 97ea4de Compare February 2, 2022 22:12
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass through this and left some nits. I'll try to revisit this this afternoon and finish looking over the individual encoders.

Comment on lines +267 to +268
// TODO (butler): make fine grain encoding of Int and Floats round trippable. For example,
// converting float32 to 64 in the float decoder isn't easily possible.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The future versions of us who go to tackle this TODO may benefit from a bit more detail on the problems you ran into.

dtStr := string(x.([]byte))
d, dependsOnCtx, err := tree.ParseDTimestamp(nil, dtStr, time.Microsecond)
if dependsOnCtx {
return nil, fmt.Errorf("TimestampTZ depends on context")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want these to be errors? I only ask because they aren't in some of the other decoders.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you looking at any other decoders aside from avro.go in changeefeeds? the avro writer/reader hooked into avro.go avoids this because it can be fed time.time types while the parquet vendor cannot.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the csv reader which I believe calls

func ParseDatumStringAs(t *types.T, s string, evalCtx *tree.EvalContext) (tree.Datum, error) {
which appears to ignore the dependsOnCtx return, but I'll admit I only took a cursory glance. Since this is currently only used in tests, this isn't a major issue either way.

Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work getting all of these types added! I'm not an expert on our type representations, so I can't claim they are all correct, but nothing particular jumped out at me. I've left a few more nitpicks.

Previously, Export Parquet could only export tables with a small subset of CRDB
data types. In this PR, I add support for all CRDB data types that avro
changefeeds support. Specifically, data types in the OID and Void families are
excluded as the former represents internal db objects and the ladder cannot be
parsed.  Further CRDB data types can be supported on an as need
basis.

This PR also adds a test that validates that EXPORT PARQUET properly encodes
data in arbitrary CRDB tables with supported data types.

Fixes  cockroachdb#67710

Release note (sql change): EXPORT PARQUET now supports all data types that avro
change feeds supports. Below I list the data type converstion from CRDB to
Parquet. To maintain backward compatibality with older parquet readers,
parquet converted types were also annotated.

To learn about more about parquet data representation,
 see https://github.com/apache/parquet-format

CRDB Type Family ->  Parquet Type | Parquet Logical Type | Parquet Converted
Type
Bool -> boolean | nil | nil
String -> byte array | string | string
Collated String -> byte array | string| string
INet -> byte array | string | string
JSON -> byte array | json | json
Int (oid.T_int8) -> int64 | int64 | int64
Int (oid.T_int4 or oid.T_int2) -> int32 | int32 | int32
Float -> float64 | nil | nil
Decimal -> byte array | decimal (note: scale and precision data are preserved
in the parquet file) | decimal
Uuid -> fixed length byte array (16 bytes) | uuid | no converted type
Bytes -> byte array | nil | nil
Bit -> byte array | nil | nil
Enum -> byte array | Enum | Enum
Box2d -> byte array | string | string
Geography -> byte array | nil | nil
Geometry -> byte array | nil | nil
Date -> byte array | string | string
Time -> int64 | time (note: microseconds since midnight) | time
TimeTz -> byte array | string | string
Interval -> byte array | string (specifically represented as ISO8601) | string
Timestamp -> byte array | string | string
TimestampTz -> byte array | string | string
Array -> encoded as a repeated field and each array value gets encoded by
pattern described above. Logical type and converted type are both 'List'
@msbutler msbutler force-pushed the butler-parquet-types branch from 186f99f to 5fb376b Compare February 9, 2022 20:20
Copy link
Copy Markdown
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @msbutler, and @stevendanna)

@HonoreDB
Copy link
Copy Markdown
Contributor

HonoreDB commented Feb 9, 2022

Encoders LGTM. You're doing a lot more string conversion than in avro which simplifies the logic, and as I understand it this is getting bulk-compressed so space efficiency isn't as much of a priority.

@msbutler
Copy link
Copy Markdown
Collaborator Author

msbutler commented Feb 9, 2022

TFTRs!!

bors r=stevendanna

@craig craig bot merged commit 0fdf716 into cockroachdb:master Feb 10, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 10, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

backupccl: add Apache Parquet format to export

4 participants