Skip to content

feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for FormattedFloat and DictionaryFloat.#37

Merged
anlowee merged 4 commits into
y-scope:presto-0.293-clp-connectorfrom
anlowee:xwei/add-formatted-float
Oct 3, 2025
Merged

feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for FormattedFloat and DictionaryFloat.#37
anlowee merged 4 commits into
y-scope:presto-0.293-clp-connectorfrom
anlowee:xwei/add-formatted-float

Conversation

@anlowee

@anlowee anlowee commented Sep 30, 2025

Copy link
Copy Markdown

Description

This PR upgrades the clp submodule to y-scope/clp@67276c0.

This PR also makes some necessary changes to support the newly added types FormattedFloat and DictionaryFloat. The timestamp columns can also be float but there could be some error induced by IEEE 754 rounding, so if the timestamp is of float numbers, then we only support up to microseconds precision. In the future, the timestamp will be stored as integer with granularity of nanoseconds, so that we won't have such issues.

This PR also adds unit tests to test the new types.

After this PR gets merged, we should merge the related Presto PR: y-scope/presto#73

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Passed the CI.

Summary by CodeRabbit

  • New Features

    • Expanded timestamp support to additional float-based formats, improving parsing coverage.
  • Bug Fixes

    • More accurate timestamp conversion using rounded microsecond precision, reducing fractional-second errors.
  • Chores

    • Updated dependency sources and resolution (including switching one dependency to URL-based fetching and adjusting source resolution) for improved build stability.
  • Tests

    • Added comprehensive tests and extended sample data covering diverse timestamp and float representations, with and without pushdown.

@coderabbitai

coderabbitai Bot commented Sep 30, 2025

Copy link
Copy Markdown

Walkthrough

Updates CMake dependency resolution for clp and ystdlib (ystdlib switched to URL-based resolution and source target renamed). Changes timestamp conversion to round fractional seconds to microseconds before nanosecond conversion. Extends archive loader to support float-based timestamp node types. Adds tests and an expanded NDJSON fixture.

Changes

Cohort / File(s) Summary of Changes
CMake: clp dependency pin
CMake/resolve_dependency_modules/clp.cmake
Updated FetchContent GIT_TAG for CLP to commit 67276c09acbd48dd502454288f40072c44628726.
CMake: ystdlib sourcing & top-level usage
CMake/resolve_dependency_modules/ystdlib.cmake, CMakeLists.txt
Replaced ystdlib Git-tag FetchContent with URL-based resolution (introduced VELOX_YSTDLIB_BUILD_VERSION, VELOX_YSTDLIB_BUILD_SHA256_CHECKSUM, VELOX_YSTDLIB_SOURCE_URL), switched to velox_resolve_dependency_url(YSTDLIB) and FetchContent_MakeAvailable(ystdlib); set ystdlib_BUILD_TESTING OFF; renamed source usage from ystdlib_cpp bundling to velox_set_source(ystdlib) and velox_resolve_dependency(ystdlib).
Timestamp conversion utility
velox/connectors/clp/search_lib/ClpTimestampsUtils.h
convertToVeloxTimestamp now derives microseconds from the fractional seconds with rounding (addresses IEEE‑754 cases) and converts those to nanoseconds for the returned Timestamp.
Archive loader: float timestamp support
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp
Added handling for FormattedFloat and DictionaryFloat node types in populateTimestampData and loadInternal, routing them through float-based readers and converting values to Velox timestamps akin to existing Float handling.
Tests: connector coverage
velox/connectors/clp/tests/ClpConnectorTest.cpp
Added unit tests covering TIMESTAMP with DOUBLE across FloatTimestamp, FormattedFloat, DictionaryFloat, and Hybrid types; includes pushdown and no-pushdown scenarios and formatting cases.
Test data fixture
velox/connectors/clp/tests/examples/test_5.ndjson
Extended NDJSON fixture with additional entries covering diverse numeric representations (tiny/large magnitudes, varied scientific notation, positive/negative zero variants) and ISO‑8601 timestamps.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Q as Query Plan
  participant L as ClpArchiveVectorLoader
  participant CR as ColumnReader
  participant U as ClpTimestampsUtils
  participant V as Velox Vector

  Q->>L: loadInternal(...)
  alt Float-based timestamp node (Float / FormattedFloat / DictionaryFloat)
    L->>CR: read next double value
    CR-->>L: double timestampValue
    L->>U: convertToVeloxTimestamp(timestampValue)
    U-->>L: Timestamp(seconds, nsecs from rounded microseconds)
    L->>V: append Timestamp
  else Integer / DateString path
    L->>CR: read & convert via integer/date path
    L->>V: append Timestamp
  end
  L-->>Q: RowVector with timestamps
  note over U: fractional seconds → rounded microseconds → converted to nanoseconds
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately reflects the primary changes by noting the CLP submodule upgrade to a specific commit and the addition of support for the new FormattedFloat and DictionaryFloat types, and it uses concise, descriptive phrasing that aligns with the pull request objectives. It avoids vague terms and clearly signals the main user-facing enhancements in a single, readable sentence.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12521a1 and 0dcd63a.

📒 Files selected for processing (7)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • CMake/resolve_dependency_modules/ystdlib.cmake (1 hunks)
  • CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1 hunks)
  • velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (3 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
  • velox/connectors/clp/tests/examples/test_5.ndjson (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (1)
velox/connectors/clp/search_lib/archive/ClpQueryRunner.h (1)
  • vector (76-78)
velox/connectors/clp/search_lib/ClpTimestampsUtils.h (1)
velox/external/date/date.h (1)
  • round (1277-1354)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (15)
CMake/resolve_dependency_modules/clp.cmake (1)

19-19: Upgrade target matches the submodule bump.

Tag 6f46f236317fe6964719a5100ddb3318521803a8 lines up with the submodule revision needed for FormattedFloat/DictionaryFloat support, so this FetchContent declaration stays consistent with the rest of the PR.

CMake/resolve_dependency_modules/ystdlib.cmake (1)

16-37: Tarball-based ystdlib wiring looks solid.

Switching to velox_resolve_dependency_url with an explicit SHA keeps the dependency reproducible and aligns the naming with velox_set_source(ystdlib). Nicely done.

CMakeLists.txt (1)

547-549: ystdlib hook-up matches the new dependency module.

Using velox_set_source(ystdlib) and velox_resolve_dependency(ystdlib) keeps the top-level wiring consistent with the updated FetchContent module, so the CLP connector will continue to pick up the bundled library.

velox/connectors/clp/tests/examples/test_5.ndjson (1)

1-49: Great coverage from the new fixture.

This NDJSON mix exercises the float/timestamp permutations we need for FormattedFloat and DictionaryFloat validation, so it should give the tests meaningful breadth.

velox/connectors/clp/tests/ClpConnectorTest.cpp (7)

476-523: LGTM! Well-structured test with clear documentation.

The test correctly validates timestamp and float handling without pushdown, with appropriate filtering and ordering. The expected values align with microsecond precision documented in the PR objectives.


583-639: LGTM! Test correctly validates FormattedFloat type handling.

The test appropriately filters for specific float values (0.0 and scientific notation) and validates the expected timestamps and float values.


641-696: LGTM! Pushdown variant correctly mirrors the no-pushdown test.

The test appropriately validates that KQL pushdown produces the same results as the no-pushdown variant for FormattedFloat types.


698-731: LGTM! DictionaryFloat test validates narrow range filtering.

The test correctly filters float values in a narrow range (1.999999 to 2.000001) to validate DictionaryFloat type handling.


733-766: LGTM! Pushdown variant correctly mirrors the no-pushdown test.

The test appropriately validates that KQL pushdown produces the same results for DictionaryFloat types.


768-806: LGTM! Hybrid test validates mixed FormattedFloat and DictionaryFloat handling.

The test correctly filters for a value (1.0) that exercises both float types and validates the expected results.


808-846: LGTM! Pushdown variant correctly mirrors the hybrid no-pushdown test.

The test appropriately validates that KQL pushdown produces the same results when both FormattedFloat and DictionaryFloat types are present.

velox/connectors/clp/search_lib/archive/ClpArchiveVectorLoader.cpp (4)

72-73: LGTM! Correctly extends switch to handle new float types.

The addition of FormattedFloat and DictionaryFloat cases properly extends the supported timestamp node types, following the same pattern as the existing Float type.


97-110: LGTM! Correctly implements timestamp population for new float types.

The implementation correctly mirrors the existing Float handling pattern, using the appropriate column reader types and converting double values to Velox timestamps.


223-232: LGTM! Correctly extends dynamic dispatch for new float types.

The dynamic_cast checks properly route FormattedFloat and DictionaryFloat column readers to their respective timestamp population methods, following the established pattern for other types.


256-271: Remove explicit instantiation suggestion for FormattedFloat and DictionaryFloat. These types are implicitly instantiated via the existing calls in the switch statement, and CI passes without linker errors.

Likely an incorrect or invalid review comment.

Comment thread velox/connectors/clp/search_lib/ClpTimestampsUtils.h
Comment thread velox/connectors/clp/tests/ClpConnectorTest.cpp

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, but both important.

OVERRIDE_FIND_PACKAGE EXCLUDE_FROM_ALL SYSTEM)

FetchContent_Populate(ystdlib_cpp)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this turn PIC on for the whole project? Is there a way that we can just set PIC for ystdlib specifically, if that's needed?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, nice catch. Let me try if I can remove it or use some target-specific way to configure it.

@@ -0,0 +1,48 @@
{ "timestamp": 1746003005.123456, "floatValue": 0.007 }
{ "timestamp": 1.746003005123457e15, "floatValue": -0.007 }

@gibber9809 gibber9809 Oct 1, 2025

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worthwhile to narrow down tests for timestamp floats to only formats that we'll support.

Generally when float literals are timestamps we've seen them written as <epoch seconds>.<sub-second-component>, where the sub-second component is up to 9 digits.

We probably won't support any ways of writing timestamps as floats that are super unexpected, such as writing them in scientific notation, because it is difficult to be sure that we're interpreting them correctly.

Because of that, I'd probably go through and remove test cases for cases we don't intend to support, which is:

  • scientific notation floats; and in the initial implementation maybe
  • negative floats

We probably will add support for parsing negative floats as timestamps in the new scheme, but I'm not sure if it will make the initial implementation or not.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e912b7a and 146ff27.

📒 Files selected for processing (1)
  • CMake/resolve_dependency_modules/ystdlib.cmake (1 hunks)

Comment thread CMake/resolve_dependency_modules/ystdlib.cmake

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 146ff27 and 97907e9.

📒 Files selected for processing (3)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
  • velox/connectors/clp/tests/examples/test_5.ndjson (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build with GCC / Ubuntu release with resolve_dependency
🔇 Additional comments (10)
velox/connectors/clp/tests/examples/test_5.ndjson (2)

1-6: Verify timestamp precision limitations are tested correctly.

Lines 1-6 use floating-point timestamps with varying fractional precision. Line 1 includes 9 fractional digits (1746003005.123456789), which appears to test nanosecond precision. However, the PR description states that "due to IEEE 754 rounding, floating-point timestamps are supported only up to microsecond precision."

The test expectations in ClpConnectorTest.cpp (lines 507-511) show timestamps with microsecond precision (e.g., Timestamp(1746003005, 124000000)), suggesting rounding occurs. Ensure the test validates that nanosecond digits are correctly rounded to microseconds, not that they're preserved exactly.


7-37: Consider alignment with previous feedback on scientific notation.

Based on past review comments, there was concern about supporting timestamps written as floats in scientific notation (e.g., 1.746003005E9). While the current test data uses ISO-8601 strings for timestamps (lines 7-37) and reserves scientific notation for the floatValue field, this aligns well with the stated constraint.

However, the extensive coverage of scientific notation edge cases (lines 7-37) for floatValue includes cases like:

  • Multiple zero padding in exponents (E+0009, E-0000)
  • Extremely large/small magnitudes
  • Mixed case notation (e9 vs E9)

Ensure these edge cases are intentionally supported by the FormattedFloat/DictionaryFloat implementation and not just test artifacts.

velox/connectors/clp/tests/ClpConnectorTest.cpp (8)

476-523: LGTM!

The test correctly validates floating-point timestamp parsing and filtering behaviour without pushdown. The expected results properly reflect microsecond precision rounding for timestamps and include appropriate float value representations.


525-567: LGTM!

The test appropriately validates pushdown behaviour with float timestamp comparisons in KQL. The comment explaining the difference in results due to KQL's timestamp interpretation is helpful for maintainability.


569-621: LGTM!

The test correctly validates FormattedFloat parsing and filtering. The expected results appropriately match the various zero representations (including signed zeros and different exponent notations) from the test data.


623-674: LGTM!

The test validates that FormattedFloat filtering produces consistent results with and without pushdown, demonstrating correct KQL query translation.


676-709: LGTM!

The test correctly validates DictionaryFloat parsing and range filtering. The narrow filter range effectively tests boundary conditions for dictionary-encoded float values.


711-744: LGTM!

The test validates consistent DictionaryFloat filtering behaviour with pushdown enabled.


746-784: LGTM!

The test effectively validates that FormattedFloat and DictionaryFloat types can coexist and be filtered correctly within the same dataset. This is crucial for verifying the hybrid encoding scenario.


786-824: LGTM!

The test confirms that hybrid FormattedFloat/DictionaryFloat filtering produces consistent results with pushdown enabled, completing comprehensive coverage of the new float type support.

clp
GIT_REPOSITORY https://github.com/y-scope/clp.git
GIT_TAG bfd4f60ffe9c5d69618cc8416ec6729c76ee9862)
GIT_TAG 67276c09acbd48dd502454288f40072c44628726)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the correct clp submodule commit hash.

The GIT_TAG 67276c09acbd48dd502454288f40072c44628726 doesn't match the commit hashes mentioned in the PR objectives (6f46f23 in the title and 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 in the description). This inconsistency could indicate that the PR description is outdated or the wrong commit was used.

Please run the following script to verify the commit history and confirm the correct target commit:


🏁 Script executed:

#!/bin/bash
# Verify the clp submodule commit hash and its relationship to the commits mentioned in the PR

# Check if the commits mentioned in the PR exist and are related
cd "$(git rev-parse --show-toplevel)" || exit 1

# Try to find the clp submodule
if [ -d "clp" ]; then
    cd clp || exit 1
    echo "=== Checking commit 67276c09acbd48dd502454288f40072c44628726 ==="
    git log --oneline -1 67276c09acbd48dd502454288f40072c44628726 2>/dev/null || echo "Commit not found"
    
    echo -e "\n=== Checking commit 6f46f23 ==="
    git log --oneline -1 6f46f23 2>/dev/null || echo "Commit not found"
    
    echo -e "\n=== Checking commit 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 ==="
    git log --oneline -1 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9 2>/dev/null || echo "Commit not found"
    
    echo -e "\n=== Recent commits on main branch ==="
    git log --oneline -10
fi

Length of output: 0


🏁 Script executed:

#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"

echo "=== Extracting GIT_REPOSITORY ==="
grep -n "GIT_REPOSITORY" "$file" | sed -E 's/.*GIT_REPOSITORY[[:space:]]+"([^"]+)".*/\1/'

echo -e "\n=== Extracting GIT_TAG ==="
grep -n "GIT_TAG" "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)\r\n]+).*/\1/'

repo_url=$(grep -m1 "GIT_REPOSITORY" "$file" | sed -E 's/.*"([^"]+)".*/\1/')
for commit in \
  67276c09acbd48dd502454288f40072c44628726 \
  6f46f23 \
  62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9; do
  echo -e "\n=== Checking commit $commit on $repo_url ==="
  git ls-remote "$repo_url" "$commit" || echo "Commit not found"
done

Length of output: 712


🏁 Script executed:

#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"

# Extract repository URL (handles unquoted values)
repo_url=$(grep -E 'GIT_REPOSITORY' "$file" | head -1 | sed -E 's/.*GIT_REPOSITORY[[:space:]]+([^[:space:]]+).*/\1/')
echo "Repository URL: $repo_url"

# Extract configured GIT_TAG
configured_tag=$(grep -E 'GIT_TAG' "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)[[:space:]]+).*/\1/')
echo "Configured GIT_TAG: $configured_tag"

# Commits to verify
commits=(
  "67276c09acbd48dd502454288f40072c44628726"
  "6f46f23"
  "62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9"
)

for commit in "${commits[@]}"; do
  echo -e "\n=== Checking commit $commit ==="
  git ls-remote "$repo_url" "$commit" || echo "Not found"
done

Length of output: 276


🏁 Script executed:

#!/usr/bin/env bash
file="CMake/resolve_dependency_modules/clp.cmake"

# Extract repository URL
repo_url=$(grep -E 'GIT_REPOSITORY' "$file" | head -1 | sed -E 's/.*GIT_REPOSITORY[[:space:]]+([^[:space:]]+).*/\1/')
echo "Repository URL: $repo_url"

# Extract configured GIT_TAG
configured_tag=$(grep -E 'GIT_TAG' "$file" | sed -E 's/.*GIT_TAG[[:space:]]+([^)[[:space:]]+).*/\1/')
echo "Configured GIT_TAG: $configured_tag"

# Commits to verify
commits=(
  "67276c09acbd48dd502454288f40072c44628726"
  "6f46f23"
  "62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9"
)

for commit in "${commits[@]}"; do
  echo -e "\n=== Checking commit $commit ==="
  out=$(git ls-remote "$repo_url" "$commit")
  if [ -z "$out" ]; then
    echo "Not found"
  else
    echo "$out"
  fi
done

Length of output: 306


Fix incorrect GIT_TAG in CMake/resolve_dependency_modules/clp.cmake
The configured tag 67276c09acbd48dd502454288f40072c44628726 (and the hashes 6f46f23, 62bea59c01516f7cb15a08a6ec6d2ad557d1d1a9) aren’t found in https://github.com/y-scope/clp.git. Update GIT_TAG to a SHA confirmed by git ls-remote on that repo.

🤖 Prompt for AI Agents
In CMake/resolve_dependency_modules/clp.cmake around line 19, the GIT_TAG value
is set to a commit SHA that does not exist in
https://github.com/y-scope/clp.git; run git ls-remote
https://github.com/y-scope/clp.git to find a valid commit SHA (or use an
existing tag/branch), then replace the incorrect GIT_TAG value with that
confirmed SHA (ensure you copy the full 40-char SHA) so the fetch works
reliably.

@anlowee

anlowee commented Oct 3, 2025

Copy link
Copy Markdown
Author

Opened an issue to track a known problem which will be addressed very soon. Then we should make corresponding changes in Velox repo: #39

@anlowee anlowee requested a review from gibber9809 October 3, 2025 18:46

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. For the PR title how about:

feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for `FormattedFloat` and `DictionaryFloat`.

@anlowee anlowee changed the title feat: Upgrade clp submodule to y-scope/clp@6f46f23 and support FormattedFloat and DictionaryFloat. feat: Upgrade clp submodule to y-scope/clp@67276c0; Add support for FormattedFloat and DictionaryFloat. Oct 3, 2025
@anlowee anlowee merged commit 1604a4d into y-scope:presto-0.293-clp-connector Oct 3, 2025
10 checks passed
@anlowee anlowee deleted the xwei/add-formatted-float branch October 3, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants