Add support for DECIMAL logical type in fixed-length byte arrays when parsing Parquet files#6369
Conversation
oeyh
left a comment
There was a problem hiding this comment.
Thanks for the fix! Were you able to verify the changes with export and stream data?
|
|
||
| if (value instanceof Map) { | ||
| Object data = ((Map<?, ?>)value).get(BYTES_KEY); | ||
| if (data instanceof byte[]) { |
There was a problem hiding this comment.
Is data always of byte[] type? Do we need an else block here?
There was a problem hiding this comment.
Did not encounter the Map use case. In my tests the values were an ArrayList type
There was a problem hiding this comment.
I think we should still have the else condition to log this case if it ever happens. Otherwise you will get the exception on line 139, which would make it seem that we didn't find the right type.
Yes, I was able to verify the changes. |
| Arguments.of(MySQLDataType.BIT, "bit_col", Map.of("bytes", new byte[]{ 1, 2, 3, 4 }), new BigInteger("16909060")), // Direct BigInteger interprets the bytes in big-endian order. | ||
|
|
||
| //DECIMAL tests represented as byte arrays | ||
| Arguments.of(MySQLDataType.DECIMAL, "decimal_col", new ArrayList<>(List.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)), new BigDecimal("1339673755198158349044581307228491536")), |
There was a problem hiding this comment.
Could be worth testing negative and fractional values as well, converted to BigDecimal?
|
|
||
| private Number handleByteArray(final Object value) { | ||
| if (value instanceof byte[]) { | ||
| return new BigDecimal(new BigInteger((byte[]) value)); |
There was a problem hiding this comment.
Each of these conditions at some point calls this: new BigDecimal(new BigInteger(someBytes). Refactor this out into a method so that changes to one affect others.
|
|
||
| if (value instanceof Map) { | ||
| Object data = ((Map<?, ?>)value).get(BYTES_KEY); | ||
| if (data instanceof byte[]) { |
There was a problem hiding this comment.
I think we should still have the else condition to log this case if it ever happens. Otherwise you will get the exception on line 139, which would make it seem that we didn't find the right type.
|
Previous implementation failed to apply the correct decimal scale during deserialization, resulting in incorrect numeric values. The fixed length byte array when passed directly to MYSQL Numeric handler loses the scale information. This change extracts the scale from the Avro schema's Decimal logical type and applies it when converting fixed byte arrays to BigDecimal, ensuring accurate decimal representation in the output JSON when reading from the Parquet files. |
| byte[] tokenBytes = new byte[] { 34, 92, 13, 10, 9}; | ||
| Schema tokenSchema = new Schema.Parser().parse( | ||
| "{ \"type\": \"record\", \"name\": \"MyRecord\", \"fields\": [" + | ||
| "{\"name\": \"token\", \"type\": {\"type\":\"fixed\",\"name\":\"TokenFixed\",\"size\":4}}" + |
There was a problem hiding this comment.
Will make this change.
| if (decimalScale != null) { | ||
| BigInteger unscaledValue = new BigInteger(bytes); | ||
| BigDecimal decimal = new BigDecimal(unscaledValue, decimalScale); | ||
| buffer.append(decimal.doubleValue()); |
There was a problem hiding this comment.
Downstream Json parser may treat it as Double but I suggest we preserve the precision here with decimal.toPlainString() or decimal.toString()
It will be Number |
Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com>
…arch-project#6369) Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com>
…arch-project#6369) Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com>
…arch-project#6369) Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com> Signed-off-by: Simon ELBAZ <elbazsimon9@gmail.com>
…arch-project#6369) Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com>
…arch-project#6369) Signed-off-by: Divyansh Bokadia <dbokadia@amazon.com>
Description
Adds proper deserialization of DECIMAL logical types stored as GenericFixed in Parquet files. Converts fixed byte arrays to BigDecimal using the schema's scale, then serializes as numeric values. Includes fallback handling for non-decimal fixed arrays as JSON byte objects.
This resolves issues where mysql decimal columns with large precision were not handled.
Issues Resolved
Resolves #6339
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.