feat: Add binary serialization codecs for CLP connector.#158
Conversation
Implement ConnectorCodecProvider with DataOutputStream-based binary codecs for all 5 CLP handle types (ColumnHandle, Split, TableHandle, TableLayoutHandle, TransactionHandle). This enables the C++ Velox worker to deserialize handles via customSerializedValue without depending on the Presto JSON protocol type system.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds connector codecs and a ConnectorCodecProvider to CLP; ClpConnector now accepts a TypeManager and exposes the provider. New deterministic binary codecs handle serialization/deserialization for table, layout, column, split, and transaction handles; a utility for UTF‑8 length‑prefixed strings and tests were added. Changes
Sequence DiagramsequenceDiagram
participant Client
participant ClpConnector as ClpConnector
participant Provider as ClpConnectorCodecProvider
participant TypeManager as TypeManager
participant Codec as ConnectorCodec
Client->>ClpConnector: getConnectorCodecProvider()
ClpConnector-->>Client: Provider (holds TypeManager)
Client->>Provider: getColumnHandleCodec()
Provider->>TypeManager: (injected) used to construct ClpColumnHandleCodec
Provider-->>Client: Optional<ConnectorCodec>
Client->>Codec: serialize(handle)
Codec-->>Client: byte[]
Client->>Codec: deserialize(byte[])
Codec->>TypeManager: resolve Type (column codec)
Codec-->>Client: handle
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java`:
- Around line 65-67: ClpColumnHandleCodec currently parses the TypeSignature and
calls typeManager.getType(...) but does not handle a null return, producing an
invalid ClpColumnHandle; add a null check after Type type =
typeManager.getType(typeSignature) in ClpColumnHandleCodec and if type is null
throw an explicit exception (e.g., IllegalArgumentException or a
PrestoException) with a clear message including the typeSignatureString and
columnName so callers fail fast instead of creating a bad ClpColumnHandle.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Around line 56-60: The deserialization reads an ordinal with
ClpSplit.SplitType.values()[in.readInt()] which can throw
ArrayIndexOutOfBoundsException for invalid/future ordinals; change it in
ClpSplitCodec to read the int into a local variable (e.g., int ordinal =
in.readInt()), check that ordinal is within 0 and
ClpSplit.SplitType.values().length - 1, and if out of range throw a clear
IOException/IllegalArgumentException describing the bad ordinal and payload
source; only after validation map ordinal to ClpSplit.SplitType and proceed to
construct the new ClpSplit(path, type, kqlQuery).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 34620496-6316-4af9-8844-12a0f454ef9c
📒 Files selected for processing (9)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpCodecProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpCodecByteFixtures.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpCodecRoundtrip.java
…ch style Rename ClpCodecProvider → ClpConnectorCodecProvider to follow Tim's TpchConnectorCodecProvider naming convention. Consolidate all 5 codec classes into inner static classes within a single file. Adopt Tim's variable naming style (cast to type-specific variable) and error handling pattern (UncheckedIOException with plain try blocks). Follows the approach proposed in prestodb#26650.
Move each codec class out of ClpConnectorCodecProvider into its own file. ClpConnectorCodecProvider now only provides the connector-level factory methods. Each codec still follows Tim's style from prestodb#26650 (cast to type-specific variable, UncheckedIOException, plain try).
…remove byte fixtures - Delete TestClpCodecByteFixtures.java - Rename TestClpCodecRoundtrip → TestClpConnectorCodecProvider - Change kql in testSplitRoundtripArchiveWithKql to LEVEL:ERROR - Remove testTableLayoutHandleMinimalRoundtrip - Remove testCodecProviderReturnsAllCodecs - Remove testEmptyStringRoundtrip - Remove testEmptyOptionalRoundtrip
There was a problem hiding this comment.
♻️ Duplicate comments (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java (1)
52-63:⚠️ Potential issue | 🟠 MajorBounds-check the split ordinal before indexing the enum.
SplitType.values()[...]will throw on bad or future ordinals, which makes corrupted payloads fail with an opaque exception.Proposed fix
- ClpSplit.SplitType type = ClpSplit.SplitType.values()[in.readInt()]; + int ordinal = in.readInt(); + ClpSplit.SplitType[] values = ClpSplit.SplitType.values(); + if (ordinal < 0 || ordinal >= values.length) { + throw new IllegalArgumentException("Unknown ClpSplit.SplitType ordinal: " + ordinal); + } + ClpSplit.SplitType type = values[ordinal];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java` around lines 52 - 63, In ClpSplitCodec.deserialize, guard against invalid ordinals before indexing ClpSplit.SplitType.values(): read the int ordinal into a local variable, check it is within 0 (inclusive) and ClpSplit.SplitType.values().length (exclusive), and if not throw a descriptive IOException (or return a clear failure) instead of directly indexing; then map the validated ordinal to ClpSplit.SplitType and continue building the ClpSplit (refer to method deserialize and enum ClpSplit.SplitType and the ClpSplit constructor).presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java (1)
59-69:⚠️ Potential issue | 🟠 MajorGuard the resolved type before constructing the handle.
TypeManager#getType(...)can returnnull; passing that through here would create an invalidClpColumnHandleand shift the failure to a later call site.Proposed fix
- Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); - return new ClpColumnHandle(columnName, originalColumnName, type); + Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); + if (type == null) { + throw new IllegalArgumentException("Unknown CLP column type: " + typeSignature); + } + return new ClpColumnHandle(columnName, originalColumnName, type);Read-only verification
#!/bin/bash set -euo pipefail sed -n '1,220p' presto-common/src/main/java/com/facebook/presto/common/type/TypeManager.java rg -n -C 2 '\bgetType\s*\(' \ presto-common/src/main/java/com/facebook/presto/common/type/TypeManager.java \ presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java` around lines 59 - 69, In ClpColumnHandleCodec.deserialize, guard against TypeManager#getType returning null before constructing a ClpColumnHandle: after resolving Type type = typeManager.getType(...), check if type == null and if so throw a clear IOException (or other appropriate checked exception declared by deserialize) with a message that includes the parsed typeSignature; otherwise proceed to return new ClpColumnHandle(...). This ensures deserialize validates the resolved type and fails fast instead of creating an invalid ClpColumnHandle.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java`:
- Around line 59-69: In ClpColumnHandleCodec.deserialize, guard against
TypeManager#getType returning null before constructing a ClpColumnHandle: after
resolving Type type = typeManager.getType(...), check if type == null and if so
throw a clear IOException (or other appropriate checked exception declared by
deserialize) with a message that includes the parsed typeSignature; otherwise
proceed to return new ClpColumnHandle(...). This ensures deserialize validates
the resolved type and fails fast instead of creating an invalid ClpColumnHandle.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Around line 52-63: In ClpSplitCodec.deserialize, guard against invalid
ordinals before indexing ClpSplit.SplitType.values(): read the int ordinal into
a local variable, check it is within 0 (inclusive) and
ClpSplit.SplitType.values().length (exclusive), and if not throw a descriptive
IOException (or return a clear failure) instead of directly indexing; then map
the validated ordinal to ClpSplit.SplitType and continue building the ClpSplit
(refer to method deserialize and enum ClpSplit.SplitType and the ClpSplit
constructor).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c3bdf0b5-fc82-4e80-973d-5c6b3a5d88f6
📒 Files selected for processing (8)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/ClpConnector.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpConnectorCodecProvider.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.javapresto-clp/src/test/java/com/facebook/presto/plugin/clp/codec/TestClpConnectorCodecProvider.java
…HandleCodec Add wire format comments to each codec class cross-referencing the C++ ClpConnectorProtocol::deserialize() counterpart. Remove redundant comments from ClpTransactionHandleCodec.
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
gibber9809
left a comment
There was a problem hiding this comment.
Nice work! Besides the other comments I left, I'm also wondering if you think there's any benefit to making the deserialization code stricter? In particular, right now it seems like all of the deserialize methods will allow and silently drop extra bytes that follow valid encoded data, e.g. <bytes for valid encoding><extra unexpected bytes>.
| ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); | ||
| DataOutputStream out = new DataOutputStream(byteOut); | ||
| out.writeUTF(columnHandle.getColumnName()); | ||
| out.writeUTF(columnHandle.getOriginalColumnName()); |
There was a problem hiding this comment.
Per java's docs the string content from writeUTF isn't actually UTF-8, and in particular uses a non-standard encoding for \u0000 and only supports encoding 4-byte characters as surrogate pairs.
To make it easier to work with this encoded data in clp-s it might be worth considering just writing our own function to read/write strings as real UTF-8.
There was a problem hiding this comment.
emmmm, I am a bit hesitant on this probably because I am not clear with the context. I am sure you are correct on Java's writeUTF not being UTF-8, but what would the consequence be? Everything we serialize here will be eventually deserialized at Velox worker, which is way before we perform data extraction with clp-s.
Using kql, the field in ClpTableLayoutHandle as an example, it will be a native C++ string when passing it to clp-s, rather than a byte buffer.
There was a problem hiding this comment.
I think I understand the confusion. Unlike other languages, c++ doesn't really have a well-defined encoding for std::string. Typically UTF-8 or ASCII will be used, but internally it's just treated as an array of bytes, so it's up to the programmer to guarantee a specific encoding.
Since clp-s works under the assumption that (most) strings are valid UTF-8, some string operations, such as string comparisons, won't work as expected when that's not the case.
In this instance, that would come up as some column names that contain characters writeUTF encodes in a non-standard way not being searchable.
Of course we could follow the alternative approach of converting the non-standard UTF-8 into regular UTF-8 on the velox side, but that's more complicated, bug-prone, and probably slower than just using UTF-8 in the first place.
There was a problem hiding this comment.
I see what you mean now, thanks for the clear explanation.
Introduced a new class CodecUtils which writes UTF encoding of Java string as
byte[] bytes = value.getBytes(StandardCharsets.UTF_8);
According to spec, it is the standard UTF 8 format. Also explained the problem in the comment of CodecUtils class definition now.
Extract writeTableHandle/readTableHandle so ClpTableLayoutHandleCodec can share the same DataOutputStream/DataInputStream instead of duplicating the table handle serialization fields.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java (1)
73-74:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast when the type signature cannot be resolved.
At Line 73,
typeManager.getType(...)can returnnull; Line 74 then builds aClpColumnHandlewith an invalid type. Please validate and throw an explicit exception with context (columnName,typeSignature) before constructing the handle.Proposed fix
String columnName = in.readUTF(); String originalColumnName = in.readUTF(); String typeSignature = in.readUTF(); - Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); + Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); + if (type == null) { + throw new IllegalArgumentException("Unknown CLP column type: " + typeSignature + " for column: " + columnName); + } return new ClpColumnHandle(columnName, originalColumnName, type);#!/bin/bash set -euo pipefail # Verify TypeManager#getType contract and ClpColumnHandle nullability expectations. fd -i 'TypeManager.java' --exec sed -n '1,220p' {} fd -i 'ClpColumnHandle.java' --exec sed -n '1,220p' {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java` around lines 73 - 74, Type resolution may return null so before constructing the ClpColumnHandle you must validate the resolved Type and throw a clear exception including columnName and typeSignature; update the code around Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) to check if type == null and throw an IllegalArgumentException (or PrestoException) with a message like "Unable to resolve type for column <columnName>: <typeSignature>" instead of returning a ClpColumnHandle with a null type; ensure you reference TypeManager.getType, TypeSignature.parseTypeSignature, ClpColumnHandle, columnName and typeSignature in the message for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Line 41: In ClpSplitCodec, avoid the direct cast "ClpSplit split = (ClpSplit)
handle;" — first check that the ConnectorSplit (handle) is an instance of
ClpSplit (e.g., using instanceof) and if not throw an IllegalArgumentException
with a clear message that includes the actual handle.getClass().getName() and
expected ClpSplit; then safely cast to ClpSplit and proceed. This ensures
ClpSplitCodec fails fast with a descriptive error instead of a
ClassCastException.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.java`:
- Around line 51-56: The deserialize method in ClpTableHandleCodec returns
immediately after calling readTableHandle(in) without verifying the input stream
was fully consumed, allowing trailing bytes to be ignored; update
ConnectorTableHandle deserialize(byte[] bytes) to wrap the DataInputStream, call
readTableHandle(in), then check in.read() (or in.available()) to detect any
remaining bytes and throw a descriptive IOException/IllegalArgumentException if
there are trailing bytes, ensuring protocol drift is caught; apply the same
trailing-byte validation pattern to ClpColumnHandleCodec, ClpSplitCodec, and
ClpTableLayoutHandleCodec deserialize methods.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.java`:
- Around line 70-84: The deserialize method in ClpTableLayoutHandleCodec
currently stops after reading fields (readTableHandle, kqlQuery, metadataSql)
but doesn't validate there are no remaining bytes, allowing truncated/extra-data
payloads to pass; update ClpTableLayoutHandleCodec.deserialize to check that the
DataInputStream has been fully consumed (e.g. verify in.read() returns -1 or
available() == 0) after constructing the ClpTableLayoutHandle and throw an
IOException/IllegalArgumentException if trailing bytes remain; apply the same
pattern to ClpTableHandleCodec.deserialize and add unit tests that pass
malformed payloads with extra or dangling bytes to ensure the validation
triggers.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.java`:
- Around line 27-35: The serialize and deserialize methods in
ClpTransactionHandleCodec must enforce invariants: update
serialize(ConnectorTransactionHandle handle) to null-check handle and verify
handle == ClpTransactionHandle.INSTANCE, throwing an IllegalArgumentException
(or similar) if not, and update deserialize(byte[] bytes) to null-check bytes
and verify bytes.length == 0, throwing on mismatch; reference the methods
serialize and deserialize and the ClpTransactionHandle.INSTANCE constant when
making these checks so cross-connector or payload-mismatch errors fail fast.
---
Duplicate comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java`:
- Around line 73-74: Type resolution may return null so before constructing the
ClpColumnHandle you must validate the resolved Type and throw a clear exception
including columnName and typeSignature; update the code around Type type =
typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) to check if
type == null and throw an IllegalArgumentException (or PrestoException) with a
message like "Unable to resolve type for column <columnName>: <typeSignature>"
instead of returning a ClpColumnHandle with a null type; ensure you reference
TypeManager.getType, TypeSignature.parseTypeSignature, ClpColumnHandle,
columnName and typeSignature in the message for debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b44c27e4-2578-4187-b8bf-461a2679c776
📒 Files selected for processing (5)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.java
Java codec is serializing the data, |
Add instanceof checks before casting in serialize() for clearer error messages on type mismatch. Add trailing-byte checks in deserialize() to reject unexpected extra data after valid encoded fields.
Java's writeUTF uses Modified UTF-8 which encodes \0 and supplementary characters differently from standard UTF-8. Since clp-s assumes real UTF-8 for string comparisons, using writeUTF would break searchability for strings containing non-ASCII content. Add CodecUtils with writeUtf8String/readUtf8String helpers that use StandardCharsets.UTF_8, keeping the same wire format (2-byte BE length prefix + encoded bytes).
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java (1)
81-83:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast when
TypeManagercannot resolve the type signature.On Line 81, if
typeManager.getType(...)returnsnull, Line 82 creates an invalidClpColumnHandlethat fails later and less clearly.Proposed fix
- Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); + Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); + if (type == null) { + throw new IOException("Unknown CLP column type signature: " + typeSignature); + } return new ClpColumnHandle(columnName, originalColumnName, type);#!/bin/bash set -euo pipefail # Verify TypeManager contract and current codec behaviour from source. fd -i "TypeManager.java" --exec sed -n '1,220p' {} fd -i "ClpColumnHandleCodec.java" --exec sed -n '1,220p' {} fd -i "ClpColumnHandle.java" --exec sed -n '1,140p' {} # Find other call sites guarding/null-checking getType results for precedent. rg -n -C2 'getType\(TypeSignature\.parseTypeSignature' --iglob '*.java'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java` around lines 81 - 83, The deserialization code in ClpColumnHandleCodec calls TypeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) and can return null, producing an invalid ClpColumnHandle; update the code in ClpColumnHandleCodec (the block that calls TypeManager.getType / TypeSignature.parseTypeSignature and constructs new ClpColumnHandle) to check the returned Type for null and immediately throw an IllegalArgumentException (or PrestoException if preferred) with a clear message containing the typeSignature and columnName; this makes the failure fail-fast and avoids constructing an invalid ClpColumnHandle.presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java (1)
71-71:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate split-type ordinal before indexing enum values.
On Line 71, direct enum indexing can throw
ArrayIndexOutOfBoundsExceptionfor malformed or drifted payloads; validate bounds and throw a clear deserialization error.Proposed fix
- ClpSplit.SplitType type = ClpSplit.SplitType.values()[in.readInt()]; + int ordinal = in.readInt(); + ClpSplit.SplitType[] values = ClpSplit.SplitType.values(); + if (ordinal < 0 || ordinal >= values.length) { + throw new IOException("Unknown ClpSplit.SplitType ordinal: " + ordinal); + } + ClpSplit.SplitType type = values[ordinal];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java` at line 71, Validate the ordinal read from the stream before indexing into the enum: read the int into a local (e.g., "int ordinal = in.readInt()"), check that ordinal >= 0 && ordinal < ClpSplit.SplitType.values().length, and if out of bounds throw a clear deserialization exception (e.g., IOException or a Presto-friendly exception) that includes the invalid ordinal; only then assign ClpSplit.SplitType type = ClpSplit.SplitType.values()[ordinal]. This protects ClpSplit.SplitType lookup from ArrayIndexOutOfBoundsException for malformed or drifted payloads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.java`:
- Around line 43-45: The type-check exception message can throw NPE when handle
is null; update the IllegalArgumentException construction in ClpTableHandleCodec
to use a null-safe class name (e.g., use handle == null ? "null" :
handle.getClass().getName()) instead of directly calling
handle.getClass().getName(); apply the same change to the analogous checks in
ClpSplitCodec, ClpTableLayoutHandleCodec, and ClpColumnHandleCodec so every
type-mismatch log uses the null-safe pattern referencing the local variable
handle.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/CodecUtils.java`:
- Around line 37-42: The writeUtf8String method currently writes a 2-byte length
with out.writeShort(bytes.length) which will wrap for lengths >65535; add an
explicit bounds check in writeUtf8String that computes byte[] bytes =
value.getBytes(StandardCharsets.UTF_8) and if bytes.length > 0xFFFF throw an
IOException (or IllegalArgumentException) with a clear message rejecting strings
longer than 65535 bytes before calling out.writeShort and out.write; reference
the writeUtf8String method, the bytes variable, and the out.writeShort call when
making the change.
---
Duplicate comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java`:
- Around line 81-83: The deserialization code in ClpColumnHandleCodec calls
TypeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) and can
return null, producing an invalid ClpColumnHandle; update the code in
ClpColumnHandleCodec (the block that calls TypeManager.getType /
TypeSignature.parseTypeSignature and constructs new ClpColumnHandle) to check
the returned Type for null and immediately throw an IllegalArgumentException (or
PrestoException if preferred) with a clear message containing the typeSignature
and columnName; this makes the failure fail-fast and avoids constructing an
invalid ClpColumnHandle.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Line 71: Validate the ordinal read from the stream before indexing into the
enum: read the int into a local (e.g., "int ordinal = in.readInt()"), check that
ordinal >= 0 && ordinal < ClpSplit.SplitType.values().length, and if out of
bounds throw a clear deserialization exception (e.g., IOException or a
Presto-friendly exception) that includes the invalid ordinal; only then assign
ClpSplit.SplitType type = ClpSplit.SplitType.values()[ordinal]. This protects
ClpSplit.SplitType lookup from ArrayIndexOutOfBoundsException for malformed or
drifted payloads.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b6ea899a-0ef4-46c9-9ba0-a91db53f56cf
📒 Files selected for processing (6)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTransactionHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/CodecUtils.java
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
Implement binary deserialization for all 5 CLP connector handle types (ClpColumnHandle, ClpSplit, ClpTableHandle, ClpTableLayoutHandle, ClpTransactionHandle) by inheriting ConnectorProtocol and overriding deserialize(). Reads binary bytes (Java DataOutputStream big-endian wire format from #158) directly into existing protocol structs using std::istringstream and folly::Endian::big. Follows Tim's TpchConnectorProtocol pattern from prestodb#26650.
gibber9809
left a comment
There was a problem hiding this comment.
Minor nit for docstring, but otherwise LGTM!
…CodecUtils.java Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
gibber9809
left a comment
There was a problem hiding this comment.
LGTM! For PR title maybe just add a period at the end to match our convention, but otherwise looks good.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java (1)
82-83:⚠️ Potential issue | 🟠 Major | ⚡ Quick winGuard unknown type signatures before constructing
ClpColumnHandle.
typeManager.getType(...)can returnnull; on Line 83 this creates an invalid handle and defers failure to later paths. Fail fast with an explicit exception when type resolution fails.Proposed fix
- Type type = typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)); - return new ClpColumnHandle(columnName, originalColumnName, type); + TypeSignature parsedTypeSignature = TypeSignature.parseTypeSignature(typeSignature); + Type type = typeManager.getType(parsedTypeSignature); + if (type == null) { + throw new IllegalArgumentException("Unknown CLP column type signature: " + typeSignature + + " for column: " + columnName); + } + return new ClpColumnHandle(columnName, originalColumnName, type);#!/bin/bash set -euo pipefail # Verify TypeManager contract (null possible) and current codec behaviour. sed -n '1,60p' presto-common/src/main/java/com/facebook/presto/common/type/TypeManager.java sed -n '70,95p' presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java # Check whether tests already cover unknown type-signature rejection. rg -n --iglob '*TestClpConnectorCodecProvider.java' 'unknown|invalid|type signature|null.*type|deserialize' presto-clp/src/test/java🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java` around lines 82 - 83, The code currently calls typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) and immediately constructs a ClpColumnHandle, but getType can return null; modify the ClpColumnHandleCodec decoding path so after resolving Type type = typeManager.getType(...) you explicitly check for null and throw a clear IllegalArgumentException (or PrestoException if preferred) including the unresolved typeSignature and columnName, instead of returning a handle with a null type; ensure the check is placed before the ClpColumnHandle constructor so callers fail fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`:
- Line 52: The codec currently writes and reads ClpSplit.SplitType using
enum.ordinal() (in ClpSplitCodec at the write/read sites), which couples the
wire format to enum order; add explicit encoder/decoder methods (e.g.,
encodeSplitType(ClpSplit.SplitType) and decodeSplitType(int)) that map each
SplitType to a stable integer ID (e.g., ARCHIVE -> 0, IR -> 1) and throw on
unknown values, then replace the direct uses of split.getType().ordinal() and
the inverse ordinal-based read with calls to these new methods so the wire
protocol is stable across enum reordering.
---
Duplicate comments:
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.java`:
- Around line 82-83: The code currently calls
typeManager.getType(TypeSignature.parseTypeSignature(typeSignature)) and
immediately constructs a ClpColumnHandle, but getType can return null; modify
the ClpColumnHandleCodec decoding path so after resolving Type type =
typeManager.getType(...) you explicitly check for null and throw a clear
IllegalArgumentException (or PrestoException if preferred) including the
unresolved typeSignature and columnName, instead of returning a handle with a
null type; ensure the check is placed before the ClpColumnHandle constructor so
callers fail fast.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d309cbf5-8c90-4ac2-a83c-69824bcc2d90
📒 Files selected for processing (5)
presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpColumnHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpTableLayoutHandleCodec.javapresto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/CodecUtils.java
| ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); | ||
| DataOutputStream out = new DataOutputStream(byteOut); | ||
| writeUtf8String(split.getPath(), out); | ||
| out.writeInt(split.getType().ordinal()); |
There was a problem hiding this comment.
Use explicit split-type wire IDs instead of enum ordinals.
Line 52 and Line 72 currently bind the wire protocol to enum declaration order. Reordering/inserting enum values can silently break cross-version decode semantics.
Proposed fix
- out.writeInt(split.getType().ordinal());
+ out.writeInt(encodeSplitType(split.getType()));
...
- ClpSplit.SplitType type = ClpSplit.SplitType.values()[in.readInt()];
+ ClpSplit.SplitType type = decodeSplitType(in.readInt());private static int encodeSplitType(ClpSplit.SplitType type)
{
switch (type) {
case ARCHIVE:
return 0;
case IR:
return 1;
default:
throw new IllegalArgumentException("Unsupported ClpSplit.SplitType: " + type);
}
}
private static ClpSplit.SplitType decodeSplitType(int id)
{
switch (id) {
case 0:
return ClpSplit.SplitType.ARCHIVE;
case 1:
return ClpSplit.SplitType.IR;
default:
throw new IllegalArgumentException("Unknown ClpSplit.SplitType id: " + id);
}
}Also applies to: 72-72
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@presto-clp/src/main/java/com/facebook/presto/plugin/clp/codec/ClpSplitCodec.java`
at line 52, The codec currently writes and reads ClpSplit.SplitType using
enum.ordinal() (in ClpSplitCodec at the write/read sites), which couples the
wire format to enum order; add explicit encoder/decoder methods (e.g.,
encodeSplitType(ClpSplit.SplitType) and decodeSplitType(int)) that map each
SplitType to a stable integer ID (e.g., ARCHIVE -> 0, IR -> 1) and throw on
unknown values, then replace the direct uses of split.getType().ordinal() and
the inverse ordinal-based read with calls to these new methods so the wire
protocol is stable across enum reordering.
b0b318f
into
y-scope:release-0.297-edge-10-clp-connector
Summary
Add binary serialization codecs for all handle types required by CLP Connector (ColumnHandle, Split, TableHandle, TableLayoutHandle, TransactionHandle) using the
ConnectorCodecProviderSPI introduced in prestodb#26257.Follows the approach proposed by Tim in prestodb#26650 for the Tpch connector.
Background & Motivation
When the Presto coordinator plans a query, it creates connector-specific handles and sends them to workers. Currently, each handle is serialized to JSON via Jackson. On the native Velox worker side, these JSON handles are deserialized into C++ structs through auto-generated code —Presto provides a build-time component called Presto Native Worker Protocol Code Generation, which generates C++ struct definitions and their
from_json()/to_json()functions from YAML schema files (presto_protocol_*.yml) using a chevron template engine. These YAML schemas live in the Presto presto_protocol directory, reference the Java handle classes in the Presto source root, and are tightly coupled with Presto's protocol code infrastructure.This works fine as long as the connector lives inside the Presto repo, since any change to the handles is as simple as regenerating the protocol files using the chevron engine. But we are migrating the CLP connector into a standalone Velox
.soplugin repo, and this codegen pipeline becomes a maintenance burden.For example, say we want to add a field to
ClpColumnHandle.java. With the current JSON approach, we must:.hand.cppfilesThe alternative is to copy over the entire Presto Presto Native Worker Protocol Code Generation pipeline within our Velox plugin. This lets us run the codegen in our own repo without forking Presto, but then any upstream change to the protocol must be ported manually.
Either approach costs us greatly in maintenance.
Solution: Binary Serialization
Binary serialization eliminates the dependency on Presto Native Worker Protocol Code Generation entirely. Instead of relying on the generated
from_json()/to_json(), a field calledcustomSerializedValueroutes the deserialization to our custom end point whose workflow is shown belox:ConnectorCodecserializes its handle to bytes into acustomSerializedValuefield in the JSON envelope sent to workers.from_json()entry point is modified to check forcustomSerializedValue(partially done at feat(connector): Add support for custom connector-provided serialization codecs prestodb/presto#26257, extended to all types at feat(native): Add customSerializedValue routing for binary connector deserialization prestodb/presto#27652). When present, from_json() routes the decoded bytes to the connector'sConnectorProtocol::deserialize()instead of parsing JSON fields through the generated code. This decouples deserialization from the protocol code generation — the plugin inherits ConnectorProtocol and implements its own deserialize() for each handle type.Now, adding a field to
ClpColumnHandle.javais simply:ClpColumnHandleCodec) to serialize/deserialize the new fieldClpConnectorProtocol::deserialize()to read the new fieldChecklist
breaking change.
Validation performed
SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100customSerializedValue, and minio as the storage.use-connector-provided-serialization-codecs=true:Summary by CodeRabbit
New Features
Tests