backupccl: add more exportable crdb data types to exportParquet#75890
backupccl: add more exportable crdb data types to exportParquet#75890craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
a03bd4e to
97ea4de
Compare
stevendanna
left a comment
There was a problem hiding this comment.
I did a first pass through this and left some nits. I'll try to revisit this this afternoon and finish looking over the individual encoders.
pkg/ccl/importccl/exportparquet.go
Outdated
| // TODO (butler): make fine grain encoding of Int and Floats round trippable. For example, | ||
| // converting float32 to 64 in the float decoder isn't easily possible. |
There was a problem hiding this comment.
The future versions of us who go to tackle this TODO may benefit from a bit more detail on the problems you ran into.
pkg/ccl/importccl/exportparquet.go
Outdated
| dtStr := string(x.([]byte)) | ||
| d, dependsOnCtx, err := tree.ParseDTimestamp(nil, dtStr, time.Microsecond) | ||
| if dependsOnCtx { | ||
| return nil, fmt.Errorf("TimestampTZ depends on context") |
There was a problem hiding this comment.
Are we sure we want these to be errors? I only ask because they aren't in some of the other decoders.
There was a problem hiding this comment.
are you looking at any other decoders aside from avro.go in changeefeeds? the avro writer/reader hooked into avro.go avoids this because it can be fed time.time types while the parquet vendor cannot.
There was a problem hiding this comment.
I was looking at the csv reader which I believe calls
which appears to ignore the dependsOnCtx return, but I'll admit I only took a cursory glance. Since this is currently only used in tests, this isn't a major issue either way.97ea4de to
186f99f
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Nice work getting all of these types added! I'm not an expert on our type representations, so I can't claim they are all correct, but nothing particular jumped out at me. I've left a few more nitpicks.
Previously, Export Parquet could only export tables with a small subset of CRDB data types. In this PR, I add support for all CRDB data types that avro changefeeds support. Specifically, data types in the OID and Void families are excluded as the former represents internal db objects and the ladder cannot be parsed. Further CRDB data types can be supported on an as need basis. This PR also adds a test that validates that EXPORT PARQUET properly encodes data in arbitrary CRDB tables with supported data types. Fixes cockroachdb#67710 Release note (sql change): EXPORT PARQUET now supports all data types that avro change feeds supports. Below I list the data type converstion from CRDB to Parquet. To maintain backward compatibality with older parquet readers, parquet converted types were also annotated. To learn about more about parquet data representation, see https://github.com/apache/parquet-format CRDB Type Family -> Parquet Type | Parquet Logical Type | Parquet Converted Type Bool -> boolean | nil | nil String -> byte array | string | string Collated String -> byte array | string| string INet -> byte array | string | string JSON -> byte array | json | json Int (oid.T_int8) -> int64 | int64 | int64 Int (oid.T_int4 or oid.T_int2) -> int32 | int32 | int32 Float -> float64 | nil | nil Decimal -> byte array | decimal (note: scale and precision data are preserved in the parquet file) | decimal Uuid -> fixed length byte array (16 bytes) | uuid | no converted type Bytes -> byte array | nil | nil Bit -> byte array | nil | nil Enum -> byte array | Enum | Enum Box2d -> byte array | string | string Geography -> byte array | nil | nil Geometry -> byte array | nil | nil Date -> byte array | string | string Time -> int64 | time (note: microseconds since midnight) | time TimeTz -> byte array | string | string Interval -> byte array | string (specifically represented as ISO8601) | string Timestamp -> byte array | string | string TimestampTz -> byte array | string | string Array -> encoded as a repeated field and each array value gets encoded by pattern described above. Logical type and converted type are both 'List'
186f99f to
5fb376b
Compare
HonoreDB
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @msbutler, and @stevendanna)
|
Encoders LGTM. You're doing a lot more string conversion than in avro which simplifies the logic, and as I understand it this is getting bulk-compressed so space efficiency isn't as much of a priority. |
|
TFTRs!! bors r=stevendanna |
|
Build succeeded: |
Previously, Export Parquet could only export tables with a small subset of CRDB
data types. In this PR, I add support for all CRDB data types that avro
changefeeds support. Specifically, data types in the OID and Void families are
excluded as the former represents internal db objects and the ladder cannot be
parsed. Further CRDB data types can be supported on an as need
basis.
This PR also adds a test that validates that EXPORT PARQUET properly encodes
data in arbitrary CRDB tables with supported data types.
Fixes #67710
Release note (sql change): EXPORT PARQUET now supports all data types that avro
change feeds supports. Below I list the data type converstion from CRDB to
Parquet.
To learn about more about parquet data representation,
see https://github.com/apache/parquet-format
CRDB Type Family -> Parquet Type | Parquet Logical Type
Bool -> boolean | nil
String -> byte array | string
Collated String -> byte array | string
INet -> byte array | string
JSON -> byte array | json
Int -> int64 | nil
Float -> float64 | nil
Decimal -> byte array | decimal (note: scale and precision data are preserved
in the parquet file)
Uuid -> fixed length byte array (16 bytes) | nil
Bytes -> byte array | nil
Bit -> byte array | nil
Enum -> byte array | Enum
Box2d -> byte array | string
Geography -> byte array | nil
Geometry -> byte array | nil
Date -> byte array | string
Time -> int64 | time (note: microseconds since midnight)
TimeTz -> byte array | string
Interval -> byte array | string (specifically represented as ISO8601)
Timestamp -> byte array | string
TimestampTz -> byte array | string
Array -> encoded as a repeated field and each array value gets encoded by
pattern described above.