feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for FormattedFloat and DictionaryFloat.#37
Conversation
WalkthroughUpdates CMake dependency resolution for clp and ystdlib (ystdlib switched to URL-based resolution and source target renamed). Changes timestamp conversion to round fractional seconds to microseconds before nanosecond conversion. Extends archive loader to support float-based timestamp node types. Adds tests and an expanded NDJSON fixture. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Q as Query Plan
participant L as ClpArchiveVectorLoader
participant CR as ColumnReader
participant U as ClpTimestampsUtils
participant V as Velox Vector
Q->>L: loadInternal(...)
alt Float-based timestamp node (Float / FormattedFloat / DictionaryFloat)
L->>CR: read next double value
CR-->>L: double timestampValue
L->>U: convertToVeloxTimestamp(timestampValue)
U-->>L: Timestamp(seconds, nsecs from rounded microseconds)
L->>V: append Timestamp
else Integer / DateString path
L->>CR: read & convert via integer/date path
L->>V: append Timestamp
end
L-->>Q: RowVector with timestamps
note over U: fractional seconds → rounded microseconds → converted to nanoseconds
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
CMake/resolve_dependency_modules/clp.cmake(1 hunks)CMake/resolve_dependency_modules/ystdlib.cmake(1 hunks)CMakeLists.txt(1 hunks)velox/connectors/clp/search_lib/ClpTimestampsUtils.h(1 hunks)velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp(3 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp(1 hunks)velox/connectors/clp/tests/examples/test_5.ndjson(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (1)
vector(76-78)
velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)
velox/external/date/date.h (1)
round(1277-1354)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (15)
CMake/resolve_dependency_modules/clp.cmake (1)
19-19: Upgrade target matches the submodule bump.Tag
6f46f236317fe6964719a5100ddb3318521803a8lines up with the submodule revision needed for FormattedFloat/DictionaryFloat support, so this FetchContent declaration stays consistent with the rest of the PR.CMake/resolve_dependency_modules/ystdlib.cmake (1)
16-37: Tarball-based ystdlib wiring looks solid.Switching to
velox_resolve_dependency_urlwith an explicit SHA keeps the dependency reproducible and aligns the naming withvelox_set_source(ystdlib). Nicely done.CMakeLists.txt (1)
547-549: ystdlib hook-up matches the new dependency module.Using
velox_set_source(ystdlib)andvelox_resolve_dependency(ystdlib)keeps the top-level wiring consistent with the updated FetchContent module, so the CLP connector will continue to pick up the bundled library.velox/connectors/clp/tests/examples/test_5.ndjson (1)
1-49: Great coverage from the new fixture.This NDJSON mix exercises the float/timestamp permutations we need for FormattedFloat and DictionaryFloat validation, so it should give the tests meaningful breadth.
velox/connectors/clp/tests/ClpConnectorTest.cpp (7)
476-523: LGTM! Well-structured test with clear documentation.The test correctly validates timestamp and float handling without pushdown, with appropriate filtering and ordering. The expected values align with microsecond precision documented in the PR objectives.
583-639: LGTM! Test correctly validates FormattedFloat type handling.The test appropriately filters for specific float values (0.0 and scientific notation) and validates the expected timestamps and float values.
641-696: LGTM! Pushdown variant correctly mirrors the no-pushdown test.The test appropriately validates that KQL pushdown produces the same results as the no-pushdown variant for FormattedFloat types.
698-731: LGTM! DictionaryFloat test validates narrow range filtering.The test correctly filters float values in a narrow range (1.999999 to 2.000001) to validate DictionaryFloat type handling.
733-766: LGTM! Pushdown variant correctly mirrors the no-pushdown test.The test appropriately validates that KQL pushdown produces the same results for DictionaryFloat types.
768-806: LGTM! Hybrid test validates mixed FormattedFloat and DictionaryFloat handling.The test correctly filters for a value (1.0) that exercises both float types and validates the expected results.
808-846: LGTM! Pushdown variant correctly mirrors the hybrid no-pushdown test.The test appropriately validates that KQL pushdown produces the same results when both FormattedFloat and DictionaryFloat types are present.
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (4)
72-73: LGTM! Correctly extends switch to handle new float types.The addition of
FormattedFloatandDictionaryFloatcases properly extends the supported timestamp node types, following the same pattern as the existingFloattype.
97-110: LGTM! Correctly implements timestamp population for new float types.The implementation correctly mirrors the existing
Floathandling pattern, using the appropriate column reader types and converting double values to Velox timestamps.
223-232: LGTM! Correctly extends dynamic dispatch for new float types.The dynamic_cast checks properly route FormattedFloat and DictionaryFloat column readers to their respective timestamp population methods, following the established pattern for other types.
256-271: Remove explicit instantiation suggestion for FormattedFloat and DictionaryFloat. These types are implicitly instantiated via the existing calls in the switch statement, and CI passes without linker errors.Likely an incorrect or invalid review comment.
gibber9809
left a comment
There was a problem hiding this comment.
Just a few comments, but both important.
| OVERRIDE_FIND_PACKAGE EXCLUDE_FROM_ALL SYSTEM) | ||
|
|
||
| FetchContent_Populate(ystdlib_cpp) | ||
| set(CMAKE_POSITION_INDEPENDENT_CODE ON) |
There was a problem hiding this comment.
Doesn't this turn PIC on for the whole project? Is there a way that we can just set PIC for ystdlib specifically, if that's needed?
There was a problem hiding this comment.
Yea, nice catch. Let me try if I can remove it or use some target-specific way to configure it.
| @@ -0,0 +1,48 @@ | |||
| { "timestamp": 1746003005.123456, "floatValue": 0.007 } | |||
| { "timestamp": 1.746003005123457e15, "floatValue": -0.007 } | |||
There was a problem hiding this comment.
I think it might be worthwhile to narrow down tests for timestamp floats to only formats that we'll support.
Generally when float literals are timestamps we've seen them written as <epoch seconds>.<sub-second-component>, where the sub-second component is up to 9 digits.
We probably won't support any ways of writing timestamps as floats that are super unexpected, such as writing them in scientific notation, because it is difficult to be sure that we're interpreting them correctly.
Because of that, I'd probably go through and remove test cases for cases we don't intend to support, which is:
- scientific notation floats; and in the initial implementation maybe
- negative floats
We probably will add support for parsing negative floats as timestamps in the new scheme, but I'm not sure if it will make the initial implementation or not.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
CMake/resolve_dependency_modules/clp.cmake(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp(1 hunks)velox/connectors/clp/tests/examples/test_5.ndjson(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (10)
velox/connectors/clp/tests/examples/test_5.ndjson (2)
1-6: Verify timestamp precision limitations are tested correctly.Lines 1-6 use floating-point timestamps with varying fractional precision. Line 1 includes 9 fractional digits (
1746003005.123456789), which appears to test nanosecond precision. However, the PR description states that "due to IEEE 754 rounding, floating-point timestamps are supported only up to microsecond precision."The test expectations in
ClpConnectorTest.cpp(lines 507-511) show timestamps with microsecond precision (e.g.,Timestamp(1746003005, 124000000)), suggesting rounding occurs. Ensure the test validates that nanosecond digits are correctly rounded to microseconds, not that they're preserved exactly.
7-37: Consider alignment with previous feedback on scientific notation.Based on past review comments, there was concern about supporting timestamps written as floats in scientific notation (e.g.,
1.746003005E9). While the current test data uses ISO-8601 strings for timestamps (lines 7-37) and reserves scientific notation for thefloatValuefield, this aligns well with the stated constraint.However, the extensive coverage of scientific notation edge cases (lines 7-37) for
floatValueincludes cases like:
- Multiple zero padding in exponents (
E+0009,E-0000)- Extremely large/small magnitudes
- Mixed case notation (
e9vsE9)Ensure these edge cases are intentionally supported by the FormattedFloat/DictionaryFloat implementation and not just test artifacts.
velox/connectors/clp/tests/ClpConnectorTest.cpp (8)
476-523: LGTM!The test correctly validates floating-point timestamp parsing and filtering behaviour without pushdown. The expected results properly reflect microsecond precision rounding for timestamps and include appropriate float value representations.
525-567: LGTM!The test appropriately validates pushdown behaviour with float timestamp comparisons in KQL. The comment explaining the difference in results due to KQL's timestamp interpretation is helpful for maintainability.
569-621: LGTM!The test correctly validates FormattedFloat parsing and filtering. The expected results appropriately match the various zero representations (including signed zeros and different exponent notations) from the test data.
623-674: LGTM!The test validates that FormattedFloat filtering produces consistent results with and without pushdown, demonstrating correct KQL query translation.
676-709: LGTM!The test correctly validates DictionaryFloat parsing and range filtering. The narrow filter range effectively tests boundary conditions for dictionary-encoded float values.
711-744: LGTM!The test validates consistent DictionaryFloat filtering behaviour with pushdown enabled.
746-784: LGTM!The test effectively validates that FormattedFloat and DictionaryFloat types can coexist and be filtered correctly within the same dataset. This is crucial for verifying the hybrid encoding scenario.
786-824: LGTM!The test confirms that hybrid FormattedFloat/DictionaryFloat filtering produces consistent results with pushdown enabled, completing comprehensive coverage of the new float type support.
| clp | ||
| GIT_REPOSITORY https://github.com/y-scope/clp.git | ||
| GIT_TAG bfd4f60ffe9c5d69618cc8416ec6729c76ee9862) | ||
| GIT_TAG 67276c09acbd48dd502454288f40072c44628726) |
There was a problem hiding this comment.
🧩 Analysis chain
Verify the correct clp submodule commit hash.
The GIT_TAG 67276c09acbd48dd502454288f40072c44628726 doesn't match the commit hashes mentioned in the PR objectives (6f46f23 in the title and 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 in the description). This inconsistency could indicate that the PR description is outdated or the wrong commit was used.
Please run the following script to verify the commit history and confirm the correct target commit:
🏁 Script executed:
#!/bin/bash
# Verify the clp submodule commit hash and its relationship to the commits mentioned in the PR
# Check if the commits mentioned in the PR exist and are related
cd "$(git rev-parse --show-toplevel)" || exit 1
# Try to find the clp submodule
if [ -d "clp" ]; then
cd clp || exit 1
echo "=== Checking commit 67276c09acbd48dd502454288f40072c44628726 ==="
git log --oneline -1 67276c09acbd48dd502454288f40072c44628726 2>/dev/null || echo "Commit not found"
echo -e "\n=== Checking commit 6f46f23 ==="
git log --oneline -1 6f46f23 2>/dev/null || echo "Commit not found"
echo -e "\n=== Checking commit 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 ==="
git log --oneline -1 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 2>/dev/null || echo "Commit not found"
echo -e "\n=== Recent commits on main branch ==="
git log --oneline -10
fiLength of output: 0
🏁 Script executed:
#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"
echo "=== Extracting GIT_REPOSITORY ==="
grep -n "GIT_REPOSITORY" "$file" | sed -E 's/.*GIT_REPOSITORY[[:space:]]+"([^"]+)".*/\1/'
echo -e "\n=== Extracting GIT_TAG ==="
grep -n "GIT_TAG" "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)\r\n]+).*/\1/'
repo_url=$(grep -m1 "GIT_REPOSITORY" "$file" | sed -E 's/.*"([^"]+)".*/\1/')
for commit in \
67276c09acbd48dd502454288f40072c44628726 \
6f46f23 \
62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9; do
echo -e "\n=== Checking commit $commit on $repo_url ==="
git ls-remote "$repo_url" "$commit" || echo "Commit not found"
doneLength of output: 712
🏁 Script executed:
#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"
# Extract repository URL (handles unquoted values)
repo_url=$(grep -E 'GIT_REPOSITORY' "$file" | head -1 | sed -E 's/.*GIT_REPOSITORY[[:space:]]+([^[:space:]]+).*/\1/')
echo "Repository URL: $repo_url"
# Extract configured GIT_TAG
configured_tag=$(grep -E 'GIT_TAG' "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)[[:space:]]+).*/\1/')
echo "Configured GIT_TAG: $configured_tag"
# Commits to verify
commits=(
"67276c09acbd48dd502454288f40072c44628726"
"6f46f23"
"62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9"
)
for commit in "${commits[@]}"; do
echo -e "\n=== Checking commit $commit ==="
git ls-remote "$repo_url" "$commit" || echo "Not found"
doneLength of output: 276
🏁 Script executed:
#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"
# Extract repository URL
repo_url=$(grep -E 'GIT_REPOSITORY' "$file" | head -1 | sed -E 's/.*GIT_REPOSITORY[[:space:]]+([^[:space:]]+).*/\1/')
echo "Repository URL: $repo_url"
# Extract configured GIT_TAG
configured_tag=$(grep -E 'GIT_TAG' "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)[[:space:]]+).*/\1/')
echo "Configured GIT_TAG: $configured_tag"
# Commits to verify
commits=(
"67276c09acbd48dd502454288f40072c44628726"
"6f46f23"
"62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9"
)
for commit in "${commits[@]}"; do
echo -e "\n=== Checking commit $commit ==="
out=$(git ls-remote "$repo_url" "$commit")
if [ -z "$out" ]; then
echo "Not found"
else
echo "$out"
fi
doneLength of output: 306
Fix incorrect GIT_TAG in CMake/resolve_dependency_modules/clp.cmake
The configured tag 67276c09acbd48dd502454288f40072c44628726 (and the hashes 6f46f23, 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9) aren’t found in https://github.com/y-scope/clp.git. Update GIT_TAG to a SHA confirmed by git ls-remote on that repo.
🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/clp.cmake around line 19, the GIT_TAG value
is set to a commit SHA that does not exist in
https://github.com/y-scope/clp.git; run git ls-remote
https://github.com/y-scope/clp.git to find a valid commit SHA (or use an
existing tag/branch), then replace the incorrect GIT_TAG value with that
confirmed SHA (ensure you copy the full 40-char SHA) so the fetch works
reliably.
|
Opened an issue to track a known problem which will be addressed very soon. Then we should make corresponding changes in Velox repo: #39 |
gibber9809
left a comment
There was a problem hiding this comment.
LGTM. For the PR title how about:
feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for `FormattedFloat` and `DictionaryFloat`.
FormattedFloat and DictionaryFloat.FormattedFloat and DictionaryFloat.
Description
This PR upgrades the clp submodule to y-scope/clp@67276c0.
This PR also makes some necessary changes to support the newly added types
FormattedFloatandDictionaryFloat. The timestamp columns can also be float but there could be some error induced by IEEE 754 rounding, so if the timestamp is of float numbers, then we only support up to microseconds precision. In the future, the timestamp will be stored as integer with granularity of nanoseconds, so that we won't have such issues.This PR also adds unit tests to test the new types.
After this PR gets merged, we should merge the related Presto PR: y-scope/presto#73
Checklist
breaking change.
Validation performed
Passed the CI.
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests