perf: Optimize basic numeric upcast#15458
perf: Optimize basic numeric upcast#15458rui-mo wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
cc: @zhouyuan |
bc58ca2 to
bd6771f
Compare
| @@ -3992,5 +3992,43 @@ TEST_F(CastExprTest, timeToTimestampCast) { | |||
| assertEqualVectors(expected, result); | |||
| } | |||
| } | |||
|
|
|||
| TEST_F(CastExprTest, integeralUpcast) { | |||
There was a problem hiding this comment.
Same about tests, I think it's important to also check
- tinyint => integer
- tinyint => bigint
- smallint => bigint
There was a problem hiding this comment.
Added all relevant cases in the test, thanks.
| @@ -47,6 +47,50 @@ const tz::TimeZone* getTimeZoneFromConfig(const core::QueryConfig& config) { | |||
| return nullptr; | |||
| } | |||
|
|
|||
| bool isIntegralType(const TypePtr& type) { | |||
There was a problem hiding this comment.
I think this optimization also should work for hugeint
There was a problem hiding this comment.
Hugeint is used to represent the decimal type. Casting an integer type to a decimal type requires rescaling to match the target scale, which is a different operation and needs special handling. Therefore, I excluded decimal from this optimization.
There was a problem hiding this comment.
Hugeint isn't only decimal. I think hugeint also just int128_t that can be.
See example int128 https://clickhouse.com/docs/sql-reference/data-types/int-uint
|
@rui-mo Can you explain it to me? You mentioned in PR description that hotspot is But at the same time you wrote that this PR optimizes only when rows are all selected.
But in such case shouldn't be Because Velox have such code template <typename Callable>
inline void SelectivityVector::applyToSelected(Callable func) const {
if (isAllSelected()) {
const auto end = end_;
for (vector_size_t row = begin_; row < end; ++row) {
func(row);
}
} else {
bits::forEachSetBit(bits_.data(), begin_, end_, func);
}
} |
bd6771f to
0eca183
Compare
0eca183 to
4133b2c
Compare
|
This optimization maybe similar to following velox/velox/connectors/hive/HivePartitionFunction.cpp Lines 119 to 123 in 4056e41 |
|
We may need to optimize applyToSelected for all the functions |
| return false; | ||
| } | ||
|
|
||
| #define VELOX_DYNAMIC_BASIC_NUMERIC_TEMPLATE_TYPE_DISPATCH( \ |
There was a problem hiding this comment.
Could we use VELOX_DYNAMIC_SCALAR_TYPE_DISPATCH and check std::is_arithmetic_v
There was a problem hiding this comment.
I changed to VELOX_DYNAMIC_SCALAR_TEMPLATE_TYPE_DISPATCH, thanks.
4133b2c to
a1b248a
Compare
There was a problem hiding this comment.
Can you please explain why do we see speedup?
Is it because there's no overflow checks? Or because something else?
and drops the try-catch .
Is it because of this? Maybe we can make this optimization more generic?
I think we can create issue and mention in this issue that this code can be dropped when more generic approach will be implemented
@MBkkt Sorry for the confusion. This PR improves performance in both scenarios. When not all rows are selected (for example, when null values are present and are deselected before casting), the performance hotspot is as follows.
And when all rows are selected, the hotspot is as follows.
Although Velox already applies the for-loop optimization you mentioned above when all rows are selected, I will include both nullable and non-nullable benchmark results in the PR description to verify that both scenarios see performance improvements. |
|
@rui-mo Thanks, I understand now. I think it will be useful to move this optimization to lambda that supplied by CastExpr to |
|
I think The only thing need is to avoid try catch when it is not necessary. |
|
@MBkkt @Yuhta Thanks for sharing your insights. I ran a few experiments to investigate where the performance improvement comes from. For CAST(integer AS bigint), the original performance is as follows: After removing the try-catch from I further removed the try-catch block from I then used perf record to identify the performance hotspot and found the results below, which show that the
In summary, the performance gains are likely attributed to:
|
a1b248a to
6ad6ade
Compare
600fb40 to
04e7cbb
Compare
MBkkt
left a comment
There was a problem hiding this comment.
Overall looks good to me.
I think more general approach to how avoid "try catch"/etc overhead can be developed separately
| isIntegralType(fromType) && isBasicNumericType(toType) && | ||
| ((fromType->cppSizeInBytes() < toType->cppSizeInBytes()) || | ||
| (fromType == INTEGER() && toType == REAL()) || | ||
| (fromType == BIGINT() && toType == REAL()) || | ||
| (fromType == BIGINT() && toType == DOUBLE()) || | ||
| (fromType == HUGEINT() && toType == REAL()) || | ||
| (fromType == HUGEINT() && toType == DOUBLE()))) { |
There was a problem hiding this comment.
real to double also should be ok?
There was a problem hiding this comment.
Nice catch, thanks. I added its support as well as tests and benchmarks.
I opened #15506 for the discussion of avoiding "try catch" in cast.
04e7cbb to
4068c77
Compare
4068c77 to
3b02501
Compare
3b02501 to
8f7b574
Compare
bb061ed to
0c1a034
Compare
0c1a034 to
d5754a2
Compare
Summary: When the row size is large (e.g., around 300,000,000), casting from a narrower integer type to a wider one—such as cast(integer as bigint)—can become time- consuming. This PR optimizes the numeric upcast by performing the cast directly on the raw values within loops, and drops the try-catch used for potential error handing. Since upcasts guarantee that the source value fits within the target type, overflow handling is unnecessary in this case. The performance gains are likely attributed to: 1) Eliminating try-catch blocks when error handling is unnecessary. 2) Improved auto-vectorization and lower function call overhead after replacing `valueAt` and `set` with direct access. 3) Avoiding overflow checks. Optimized conversions include: ``` CAST(tinyint AS smallint) CAST(tinyint AS integer) CAST(tinyint AS bigint) CAST(tinyint AS real) CAST(tinyint AS double) CAST(smallint AS integer) CAST(smallint AS bigint) CAST(smallint AS real) CAST(smallint AS double) CAST(integer AS bigint) CAST(integer AS real) CAST(integer AS double) CAST(bigint AS real) CAST(bigint AS double) CAST(real AS double) ``` Before: ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ numeric_upcast##cast_tinyint_nullable_as_smalli 57.89s 17.27m numeric_upcast##cast_tinyint_as_smallint 1.09min 15.32m numeric_upcast##cast_tinyint_nullable_as_intege 59.07s 16.93m numeric_upcast##cast_tinyint_as_integer 1.09min 15.25m numeric_upcast##cast_tinyint_nullable_as_bigint 1.03min 16.12m numeric_upcast##cast_tinyint_as_bigint 1.13min 14.71m numeric_upcast##cast_tinyint_nullable_as_real 1.87min 8.90m numeric_upcast##cast_tinyint_as_real 2.16min 7.71m numeric_upcast##cast_tinyint_nullable_as_double 1.79min 9.29m numeric_upcast##cast_tinyint_as_double 2.06min 8.10m numeric_upcast##cast_smallint_nullable_as_integ 59.30s 16.86m numeric_upcast##cast_smallint_as_integer 1.11min 15.01m numeric_upcast##cast_smallint_nullable_as_bigin 1.02min 16.29m numeric_upcast##cast_smallint_as_bigint 1.14min 14.59m numeric_upcast##cast_smallint_nullable_as_real 1.99min 8.37m numeric_upcast##cast_smallint_as_real 2.29min 7.26m numeric_upcast##cast_smallint_nullable_as_doubl 1.80min 9.28m numeric_upcast##cast_smallint_as_double 2.03min 8.23m numeric_upcast##cast_integer_nullable_as_bigint 1.03min 16.24m numeric_upcast##cast_integer_as_bigint 1.12min 14.89m numeric_upcast##cast_integer_nullable_as_real 1.40min 11.88m numeric_upcast##cast_integer_as_real 1.64min 10.15m numeric_upcast##cast_integer_nullable_as_double 1.44min 11.56m numeric_upcast##cast_integer_as_double 1.65min 10.09m numeric_upcast##cast_bigint_nullable_as_real 1.41min 11.78m numeric_upcast##cast_bigint_as_real 1.65min 10.12m numeric_upcast##cast_bigint_nullable_as_double 1.46min 11.44m numeric_upcast##cast_bigint_as_double 1.65min 10.09m numeric_upcast##cast_real_nullable_as_double 1.43min 11.64m numeric_upcast##cast_real_as_double 1.69min 9.85m ---------------------------------------------------------------------------- ``` After: ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ numeric_upcast##cast_tinyint_nullable_as_smalli 15.12s 66.12m numeric_upcast##cast_tinyint_as_smallint 931.11ms 1.07 numeric_upcast##cast_tinyint_nullable_as_intege 16.61s 60.22m numeric_upcast##cast_tinyint_as_integer 2.21s 451.83m numeric_upcast##cast_tinyint_nullable_as_bigint 19.33s 51.73m numeric_upcast##cast_tinyint_as_bigint 4.32s 231.37m numeric_upcast##cast_tinyint_nullable_as_real 16.50s 60.62m numeric_upcast##cast_tinyint_as_real 2.83s 353.33m numeric_upcast##cast_tinyint_nullable_as_double 19.13s 52.26m numeric_upcast##cast_tinyint_as_double 8.07s 123.97m numeric_upcast##cast_smallint_nullable_as_integ 18.96s 52.76m numeric_upcast##cast_smallint_as_integer 1.78s 561.26m numeric_upcast##cast_smallint_nullable_as_bigin 18.72s 53.41m numeric_upcast##cast_smallint_as_bigint 4.38s 228.11m numeric_upcast##cast_smallint_nullable_as_real 16.34s 61.19m numeric_upcast##cast_smallint_as_real 1.71s 583.54m numeric_upcast##cast_smallint_nullable_as_doubl 18.65s 53.62m numeric_upcast##cast_smallint_as_double 2.85s 351.36m numeric_upcast##cast_integer_nullable_as_bigint 18.43s 54.26m numeric_upcast##cast_integer_as_bigint 3.47s 288.11m numeric_upcast##cast_integer_nullable_as_real 16.58s 60.32m numeric_upcast##cast_integer_as_real 2.41s 414.34m numeric_upcast##cast_integer_nullable_as_double 18.53s 53.97m numeric_upcast##cast_integer_as_double 3.34s 299.25m numeric_upcast##cast_bigint_nullable_as_real 16.82s 59.45m numeric_upcast##cast_bigint_as_real 4.23s 236.48m numeric_upcast##cast_bigint_nullable_as_double 19.15s 52.22m numeric_upcast##cast_bigint_as_double 4.56s 219.08m numeric_upcast##cast_real_nullable_as_double 18.35s 54.50m numeric_upcast##cast_real_as_double 3.43s 291.53m ---------------------------------------------------------------------------- ``` Replace: #15458 Pull Request resolved: #16967 Reviewed By: peterenescu Differential Revision: D99139839 Pulled By: bikramSingh91 fbshipit-source-id: ffb57fcd6bcab15e32b72deba2f53c2aea8ba102
Summary: When the row size is large (e.g., around 300,000,000), casting from a narrower integer type to a wider one—such as cast(integer as bigint)—can become time- consuming. This PR optimizes the numeric upcast by performing the cast directly on the raw values within loops, and drops the try-catch used for potential error handing. Since upcasts guarantee that the source value fits within the target type, overflow handling is unnecessary in this case. The performance gains are likely attributed to: 1) Eliminating try-catch blocks when error handling is unnecessary. 2) Improved auto-vectorization and lower function call overhead after replacing `valueAt` and `set` with direct access. 3) Avoiding overflow checks. Optimized conversions include: ``` CAST(tinyint AS smallint) CAST(tinyint AS integer) CAST(tinyint AS bigint) CAST(tinyint AS real) CAST(tinyint AS double) CAST(smallint AS integer) CAST(smallint AS bigint) CAST(smallint AS real) CAST(smallint AS double) CAST(integer AS bigint) CAST(integer AS real) CAST(integer AS double) CAST(bigint AS real) CAST(bigint AS double) CAST(real AS double) ``` Before: ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ numeric_upcast##cast_tinyint_nullable_as_smalli 57.89s 17.27m numeric_upcast##cast_tinyint_as_smallint 1.09min 15.32m numeric_upcast##cast_tinyint_nullable_as_intege 59.07s 16.93m numeric_upcast##cast_tinyint_as_integer 1.09min 15.25m numeric_upcast##cast_tinyint_nullable_as_bigint 1.03min 16.12m numeric_upcast##cast_tinyint_as_bigint 1.13min 14.71m numeric_upcast##cast_tinyint_nullable_as_real 1.87min 8.90m numeric_upcast##cast_tinyint_as_real 2.16min 7.71m numeric_upcast##cast_tinyint_nullable_as_double 1.79min 9.29m numeric_upcast##cast_tinyint_as_double 2.06min 8.10m numeric_upcast##cast_smallint_nullable_as_integ 59.30s 16.86m numeric_upcast##cast_smallint_as_integer 1.11min 15.01m numeric_upcast##cast_smallint_nullable_as_bigin 1.02min 16.29m numeric_upcast##cast_smallint_as_bigint 1.14min 14.59m numeric_upcast##cast_smallint_nullable_as_real 1.99min 8.37m numeric_upcast##cast_smallint_as_real 2.29min 7.26m numeric_upcast##cast_smallint_nullable_as_doubl 1.80min 9.28m numeric_upcast##cast_smallint_as_double 2.03min 8.23m numeric_upcast##cast_integer_nullable_as_bigint 1.03min 16.24m numeric_upcast##cast_integer_as_bigint 1.12min 14.89m numeric_upcast##cast_integer_nullable_as_real 1.40min 11.88m numeric_upcast##cast_integer_as_real 1.64min 10.15m numeric_upcast##cast_integer_nullable_as_double 1.44min 11.56m numeric_upcast##cast_integer_as_double 1.65min 10.09m numeric_upcast##cast_bigint_nullable_as_real 1.41min 11.78m numeric_upcast##cast_bigint_as_real 1.65min 10.12m numeric_upcast##cast_bigint_nullable_as_double 1.46min 11.44m numeric_upcast##cast_bigint_as_double 1.65min 10.09m numeric_upcast##cast_real_nullable_as_double 1.43min 11.64m numeric_upcast##cast_real_as_double 1.69min 9.85m ---------------------------------------------------------------------------- ``` After: ``` ============================================================================ [...]hmarks/ExpressionBenchmarkBuilder.cpp relative time/iter iters/s ============================================================================ numeric_upcast##cast_tinyint_nullable_as_smalli 15.12s 66.12m numeric_upcast##cast_tinyint_as_smallint 931.11ms 1.07 numeric_upcast##cast_tinyint_nullable_as_intege 16.61s 60.22m numeric_upcast##cast_tinyint_as_integer 2.21s 451.83m numeric_upcast##cast_tinyint_nullable_as_bigint 19.33s 51.73m numeric_upcast##cast_tinyint_as_bigint 4.32s 231.37m numeric_upcast##cast_tinyint_nullable_as_real 16.50s 60.62m numeric_upcast##cast_tinyint_as_real 2.83s 353.33m numeric_upcast##cast_tinyint_nullable_as_double 19.13s 52.26m numeric_upcast##cast_tinyint_as_double 8.07s 123.97m numeric_upcast##cast_smallint_nullable_as_integ 18.96s 52.76m numeric_upcast##cast_smallint_as_integer 1.78s 561.26m numeric_upcast##cast_smallint_nullable_as_bigin 18.72s 53.41m numeric_upcast##cast_smallint_as_bigint 4.38s 228.11m numeric_upcast##cast_smallint_nullable_as_real 16.34s 61.19m numeric_upcast##cast_smallint_as_real 1.71s 583.54m numeric_upcast##cast_smallint_nullable_as_doubl 18.65s 53.62m numeric_upcast##cast_smallint_as_double 2.85s 351.36m numeric_upcast##cast_integer_nullable_as_bigint 18.43s 54.26m numeric_upcast##cast_integer_as_bigint 3.47s 288.11m numeric_upcast##cast_integer_nullable_as_real 16.58s 60.32m numeric_upcast##cast_integer_as_real 2.41s 414.34m numeric_upcast##cast_integer_nullable_as_double 18.53s 53.97m numeric_upcast##cast_integer_as_double 3.34s 299.25m numeric_upcast##cast_bigint_nullable_as_real 16.82s 59.45m numeric_upcast##cast_bigint_as_real 4.23s 236.48m numeric_upcast##cast_bigint_nullable_as_double 19.15s 52.22m numeric_upcast##cast_bigint_as_double 4.56s 219.08m numeric_upcast##cast_real_nullable_as_double 18.35s 54.50m numeric_upcast##cast_real_as_double 3.43s 291.53m ---------------------------------------------------------------------------- ``` Replace: facebookincubator#15458 Pull Request resolved: facebookincubator#16967 Reviewed By: peterenescu Differential Revision: D99139839 Pulled By: bikramSingh91 fbshipit-source-id: ffb57fcd6bcab15e32b72deba2f53c2aea8ba102



When the row size is large (e.g., around 300,000,000), casting from a narrower
integer type to a wider one—such as cast(integer as bigint)—can become time-
consuming.
This PR optimizes the numeric upcast by performing the cast directly on the
raw values within loops, and drops the try-catch used for potential error handing.
Since upcasts guarantee that the source value fits within the target type, overflow
handling is unnecessary in this case.
The performance gains are likely attributed to:
valueAtandsetwith direct access.Optimized conversions include:
Before:
After: