Skip to content

Add support to provide thrift codec for connector specific fields#25242

Merged
shangm2 merged 1 commit into
prestodb:masterfrom
shangm2:thrift_split
Aug 4, 2025
Merged

Add support to provide thrift codec for connector specific fields#25242
shangm2 merged 1 commit into
prestodb:masterfrom
shangm2:thrift_split

Conversation

@shangm2

@shangm2 shangm2 commented Jun 3, 2025

Copy link
Copy Markdown
Contributor

Description

TODO: Clean up the json field #25671

To keep the trunk cleaner and also move this pr forward, we had a offline discussion and agree to update this pr with the following items:

  1. Keep ConnectorSplit and ConnectorTransactionHandle in ConnectorCodecProvider.
  2. Remove Hive implementation, in particular HiveSplit and HiveTransactionHandle removing mixed JSON/Thrift support
  3. Make ExecutionWriterTarget a Union and add support in ConnectorCodecProvider for related handles .
  4. Deprecate MetadataUpdate.

The original description:

  1. Add thrift support for split and transaction handle
  2. but only activated if the feature toggle is on and a proper connector specific codec is provided
  3. We will be using byte array for serialization for now and will iterate on the interface definition within SPI to avoid unnecessary allocations for better performance.
  4. Instructions on how to use the thrift .idl file can be found in the rfc.
Screenshot 2025-07-16 at 07 58 01 5. cpp side changes: #25595

Motivation and Context

prestodb/rfcs#38

Impact

Test Plan

  1. passed verifier run

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Improve efficiency of coordinator by supporting thrift codec for connector-specific data.

@shangm2 shangm2 requested a review from a team as a code owner June 3, 2025 16:35
@shangm2 shangm2 requested a review from jaystarshot June 3, 2025 16:35
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Jun 3, 2025
@shangm2 shangm2 changed the title Gradually migrate hive split fields to thrift but might still have some fields as json [NRFR] Gradually migrate hive split fields to thrift but might still have some fields as json Jun 3, 2025
@shangm2 shangm2 force-pushed the thrift_split branch 2 times, most recently from 58c0a15 to 0326cee Compare June 3, 2025 22:50
@shangm2 shangm2 force-pushed the thrift_split branch 2 times, most recently from 58be3d3 to 8b437b7 Compare June 4, 2025 06:33
@shangm2 shangm2 changed the title [NRFR] Gradually migrate hive split fields to thrift but might still have some fields as json Gradually migrate hive split fields to thrift but might still have some fields as json Jun 4, 2025
@shangm2 shangm2 changed the title Gradually migrate hive split fields to thrift but might still have some fields as json Add thrift codec for hive split Jun 4, 2025
Comment thread presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitCodec.java Outdated
Comment thread presto-spi/src/main/java/com/facebook/presto/spi/ConnectorSpecificCodec.java Outdated
Comment thread presto-hive/src/main/java/com/facebook/presto/hive/HiveSplitCodec.java Outdated
Comment thread presto-spi/src/main/java/com/facebook/presto/spi/SplitContext.java
@tdcmeehan tdcmeehan self-assigned this Jun 9, 2025
@shangm2 shangm2 force-pushed the thrift_split branch 2 times, most recently from c4c8759 to fe0490a Compare June 11, 2025 16:58
@shangm2 shangm2 force-pushed the thrift_split branch 3 times, most recently from 03c06c1 to 4dd32d9 Compare July 16, 2025 16:57
@tdcmeehan

Copy link
Copy Markdown
Contributor

This could work for simple java types but for complicated fields, we wont have the json codec until creating a instance.

Can you explain this a bit more? I see as an example TableWriteInfoCodec has a JsonCodec wired into its constructor, isn't this what's needed? Couldn't we do something along the lines of what is found in UuidToLeachSalzBinaryEncodingThriftCodec--its Thrift type is binary, and what we do specially is we convert the ByteBuffer contents and feed it into a JsonCodec, as found in e.g. TableWriteInfoCodec. Yes you would need static methods with @FromThrift and @ToThrift, but everything should be in that binary payload and easily convertible to the JsonCodec.

I totally agree with not adding cost to the project. For those connector-specific fields with json serde, we don't recommend fully converting it to thrift for now so we believe they are not tech debt. It might be confusing to have a json field in a thrift payload but they are all connector-specific and the serde framework this pr adds does give the freedom to use the appropriate serde for those fields in different scenarios.

Maybe I'm misunderstanding. From what I see in the code, I only see splits and transaction handles actually participating in the serialization. Other things like table write handles do not, and they are converted into JSON using the old method. I don't think it's OK to leave the framework to omit some connector-specific fields--consider the connector author in C++ needing to JSON serialize some fields, Thrift serialize others. I'm OK to wait for the better solution, which is that all connector specific fields may be serialized by the connector, rather than to have to potentially migrate at a later point in time after production dependencies on this halfway functionality have already been added.

@tdcmeehan

Copy link
Copy Markdown
Contributor

But once again, I think as long as it's connector-dictated and not Presto engine-dictated, that's fine. If we can add the rest of the fields to ConnectorCodecProvider, but provide default implementations which serialize and deserialize as JSON, that's... messy, but fine.

@shangm2

shangm2 commented Jul 17, 2025

Copy link
Copy Markdown
Contributor Author

But once again, I think as long as it's connector-dictated and not Presto engine-dictated, that's fine. If we can add the rest of the fields to ConnectorCodecProvider, but provide default implementations which serialize and deserialize as JSON, that's... messy, but fine.

The custom codec is being registered to thrift codec manager and during serde, thrift codec will automatically discover them including the thrift serde for split, transacation handle and all their fields recursively. If you are fine with adding custom serde in ConnectorCodecProvider manually, why not just registering the custom codec for the specific connector and let drift recursively find them.

@shangm2

shangm2 commented Jul 17, 2025

Copy link
Copy Markdown
Contributor Author

This could work for simple java types but for complicated fields, we wont have the json codec until creating a instance.

Can you explain this a bit more? I see as an example TableWriteInfoCodec has a JsonCodec wired into its constructor, isn't this what's needed? Couldn't we do something along the lines of what is found in UuidToLeachSalzBinaryEncodingThriftCodec--its Thrift type is binary, and what we do specially is we convert the ByteBuffer contents and feed it into a JsonCodec, as found in e.g. TableWriteInfoCodec. Yes you would need static methods with @FromThrift and @ToThrift, but everything should be in that binary payload and easily convertible to the JsonCodec.

I totally agree with not adding cost to the project. For those connector-specific fields with json serde, we don't recommend fully converting it to thrift for now so we believe they are not tech debt. It might be confusing to have a json field in a thrift payload but they are all connector-specific and the serde framework this pr adds does give the freedom to use the appropriate serde for those fields in different scenarios.

Maybe I'm misunderstanding. From what I see in the code, I only see splits and transaction handles actually participating in the serialization. Other things like table write handles do not, and they are converted into JSON using the old method. I don't think it's OK to leave the framework to omit some connector-specific fields--consider the connector author in C++ needing to JSON serialize some fields, Thrift serialize others. I'm OK to wait for the better solution, which is that all connector specific fields may be serialized by the connector, rather than to have to potentially migrate at a later point in time after production dependencies on this halfway functionality have already been added.

The json codec is not available in a static method. But let me give the static method another try.

@shangm2

shangm2 commented Jul 17, 2025

Copy link
Copy Markdown
Contributor Author

If it is fine to you, I can remove all hive related changes and only have thrift annotation for presto-main and presto-main-base.

@shangm2 shangm2 changed the title Add thrift codec for split and transaction handle from hive and $remote Add support to provide thrift codec for connector specific fields Jul 17, 2025
@shangm2

shangm2 commented Jul 23, 2025

Copy link
Copy Markdown
Contributor Author

Hey @tdcmeehan Feel free to give this pr another review. Thanks.

@tdcmeehan

Copy link
Copy Markdown
Contributor

Appreciate the follow through, thanks @shangm2!

@facebook-github-bot

Copy link
Copy Markdown
Collaborator

@shangm2 has imported this pull request. If you are a Meta employee, you can view this in D76874534.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants