perf(overflow): Speed up overflow checks for small integers by cbandy · Pull Request #303 · apache/arrow-go

cbandy · 2025-03-06T01:29:06Z

Rationale for this change

We can check the bounds of the arguments to avoid an int64 division. The Mul and Mul64 functions perform better on the systems I tested.

This is an older machine:

goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/internal/utils
cpu: Intel(R) Core(TM) i5-7Y57 CPU @ 1.20GHz
                                  │ bench_amd64_old.txt │         bench_amd64_new.txt         │
                                  │       sec/op        │    sec/op     vs base               │
Add/int/8192_+_8192-4                      1.990n ± 14%   2.001n ± 18%        ~ (p=1.000 n=6)
Add/int/MaxInt16_+_1-4                     2.248n ±  1%   2.271n ±  4%   +1.07% (p=0.035 n=6)
Add/int/MaxInt16_+_5-4                     2.215n ±  2%   2.270n ±  3%   +2.48% (p=0.013 n=6)
Add/int/MaxInt16_+_MaxInt16-4              2.293n ±  4%   2.331n ±  9%        ~ (p=0.195 n=6)
Add/int/MaxInt32_+_1-4                     2.339n ± 16%   2.345n ± 67%        ~ (p=0.937 n=6)
Add/int/MaxInt32_+_5-4                     2.363n ±  4%   2.330n ±  2%        ~ (p=0.331 n=6)
Add/int/MaxInt32_+_MaxInt32-4              2.402n ± 23%   2.343n ±  3%        ~ (p=0.310 n=6)
Add/int/MaxInt_+_MaxInt-4                  2.334n ±  3%   2.364n ±  2%        ~ (p=0.180 n=6)
Mul/int/8192_×_8192-4                     14.335n ±  3%   3.578n ±  5%  -75.04% (p=0.002 n=6)
Mul/int/MaxInt16_×_1-4                    14.015n ± 90%   3.718n ± 20%  -73.47% (p=0.002 n=6)
Mul/int/MaxInt16_×_5-4                    14.480n ± 25%   3.647n ±  3%  -74.82% (p=0.002 n=6)
Mul/int/MaxInt16_×_MaxInt16-4             14.450n ± 30%   3.627n ± 35%  -74.90% (p=0.002 n=6)
Mul/int/MaxInt32_×_1-4                    14.725n ±  2%   3.597n ± 12%  -75.57% (p=0.002 n=6)
Mul/int/MaxInt32_×_5-4                    17.485n ± 31%   3.667n ±  5%  -79.03% (p=0.002 n=6)
Mul/int/MaxInt32_×_MaxInt32-4             14.385n ± 31%   5.704n ± 48%  -60.35% (p=0.002 n=6)
Mul/int/MaxInt_×_MaxInt-4                  14.34n ±  3%   15.68n ±  4%   +9.34% (p=0.002 n=6)
Mul64/int64/8192_×_8192-4                 14.495n ±  6%   3.730n ±  4%  -74.27% (p=0.002 n=6)
Mul64/int64/MaxInt16_×_1-4                14.300n ±  3%   3.612n ±  1%  -74.74% (p=0.002 n=6)
Mul64/int64/MaxInt16_×_5-4                14.020n ± 20%   3.657n ±  3%  -73.91% (p=0.002 n=6)
Mul64/int64/MaxInt16_×_MaxInt16-4         14.110n ±  2%   3.611n ± 12%  -74.41% (p=0.002 n=6)
Mul64/int64/MaxInt32_×_1-4                14.330n ±  2%   3.609n ±  4%  -74.82% (p=0.002 n=6)
Mul64/int64/MaxInt32_×_5-4                14.250n ± 14%   3.670n ±  2%  -74.25% (p=0.002 n=6)
Mul64/int64/MaxInt32_×_MaxInt32-4         14.070n ±  2%   3.571n ±  3%  -74.62% (p=0.002 n=6)
Mul64/int64/MaxInt_×_MaxInt-4              14.17n ±  2%   15.77n ±  3%  +11.25% (p=0.002 n=6)
geomean                                    7.807n         3.583n        -54.10%

The following is inside a docker.io/i386/ubuntu container on that same machine. I don't have any 32-bit hardware.

goos: linux
goarch: 386
pkg: github.com/apache/arrow-go/v18/internal/utils
cpu: Intel(R) Core(TM) i5-7Y57 CPU @ 1.20GHz
                                                        │ bench_386_old.txt │          bench_386_new.txt          │
                                                        │      sec/op       │    sec/op     vs base               │
Add/int/8192_+_8192-4                                          3.113n ± 22%   2.848n ± 17%        ~ (p=0.180 n=6)
Add/int/MaxInt16_+_1-4                                         3.344n ±  4%   3.294n ± 18%        ~ (p=0.818 n=6)
Add/int/MaxInt16_+_5-4                                         3.420n ±  7%   3.411n ±  5%        ~ (p=0.937 n=6)
Add/int/MaxInt16_+_MaxInt16-4                                  3.360n ±  3%   3.402n ± 30%        ~ (p=0.240 n=6)
Add/int/MaxInt_+_1-4                                           3.329n ±  2%   3.433n ±  3%   +3.09% (p=0.039 n=6)
Add/int/MaxInt_+_5-4                                           3.452n ±  5%   3.436n ±  2%        ~ (p=0.937 n=6)
Add/int/MaxInt_+_MaxInt-4                                      3.353n ±  6%   3.454n ±  3%        ~ (p=0.132 n=6)
Add/int/MaxInt_+_MaxInt#01-4                                   3.346n ± 13%   3.488n ±  3%        ~ (p=0.132 n=6)
Mul/int/8192_×_8192-4                                         11.250n ±  2%   6.238n ±  6%  -44.55% (p=0.002 n=6)
Mul/int/MaxInt16_×_1-4                                        11.565n ±  4%   6.261n ±  1%  -45.86% (p=0.002 n=6)
Mul/int/MaxInt16_×_5-4                                        11.300n ±  3%   6.505n ±  3%  -42.44% (p=0.002 n=6)
Mul/int/MaxInt16_×_MaxInt16-4                                 11.370n ±  4%   6.394n ± 40%  -43.76% (p=0.002 n=6)
Mul/int/MaxInt_×_1-4                                          11.245n ±  3%   5.798n ±  2%  -48.44% (p=0.002 n=6)
Mul/int/MaxInt_×_5-4                                          11.210n ± 18%   9.941n ±  2%  -11.32% (p=0.002 n=6)
Mul/int/MaxInt_×_MaxInt-4                                     10.975n ±  4%   9.895n ±  2%   -9.84% (p=0.002 n=6)
Mul/int/MaxInt_×_MaxInt#01-4                                   11.40n ± 18%   10.17n ±  2%  -10.83% (p=0.002 n=6)
Mul64/int64/8192_×_8192-4                                      21.74n ± 70%   23.75n ±  2%        ~ (p=0.065 n=6)
Mul64/int64/MaxInt16_×_1-4                                     22.37n ± 28%   23.87n ± 12%        ~ (p=0.065 n=6)
Mul64/int64/MaxInt16_×_5-4                                     22.09n ±  2%   23.96n ± 23%   +8.44% (p=0.002 n=6)
Mul64/int64/MaxInt16_×_MaxInt16-4                              21.06n ±  3%   24.27n ±  3%  +15.24% (p=0.002 n=6)
Mul64/int64/MaxInt_×_1-4                                       21.42n ±  8%   23.91n ± 64%  +11.62% (p=0.002 n=6)
Mul64/int64/MaxInt_×_5-4                                       29.87n ±  4%   24.16n ± 22%  -19.09% (p=0.009 n=6)
Mul64/int64/MaxInt_×_MaxInt-4                                  28.10n ±  2%   24.40n ±  2%  -13.17% (p=0.002 n=6)
Mul64/int64/9223372036854775807_×_9223372036854775807-4        23.22n ±  6%   31.90n ± 35%  +37.38% (p=0.002 n=6)
geomean                                                        9.609n         8.523n        -11.30%

What changes are included in this PR?

A new generic Add function in internal/utils returns the sum of any two integers and whether or not it overflowed.

New Mul and Mul64 functions in internal/utils return the product of two int or int64, respectively, and whether or not it overflowed. These functions perform better on the systems I tested.

Are these changes tested?

Yes.

Are there any user-facing changes?

No effect on the API. This drops one dependency.

cbandy · 2025-03-06T13:18:51Z

Sorry, I missed some hunks in my first commit. All fixed.

cbandy · 2025-03-07T02:20:10Z

It occurred to me today that this is better as two commits. The benchmarks are now committed, and one can try both implementations by switching between commits.

The first commit adds the new functions, their tests, and benchmarks for the current implementation.
The second commit uses the new functions throughout.

We can check the bounds of the arguments to avoid an int64 division. This adds benchmarks for the existing implementation, new functions, and tests, but does not change any behavior. Signed-off-by: Chris Bandy <bandy.chris@gmail.com>

These have proved faster in the aggregate and allow us to drop one dependency. Signed-off-by: Chris Bandy <bandy.chris@gmail.com>

zeroshade · 2025-03-07T16:31:02Z

Thanks for this @cbandy! I'll take a closer look at it this afternoon!

zeroshade

Looks good to me @cbandy I'm seeing similar performance benefits when i test it myself locally. though I am curious about the small number of cases where it is slower. But I also want this in so I can start a release. The small number of cases that slow down don't seem to be worthwhile enough to be an issue so I'm going to merge this.

It would be awesome if you were able to figure out if those cases that were slower were just noise or indicative of something that might be an issue.

Thanks again for this!

cbandy requested a review from zeroshade as a code owner March 6, 2025 01:29

cbandy force-pushed the fast-overflow branch from 51f6a46 to 68d7e59 Compare March 6, 2025 13:07

cbandy force-pushed the fast-overflow branch from 68d7e59 to cc542c3 Compare March 7, 2025 02:12

cbandy added 2 commits March 6, 2025 20:46

perf(overflow): Use the new integer overflow funcs

4d9af76

These have proved faster in the aggregate and allow us to drop one dependency. Signed-off-by: Chris Bandy <bandy.chris@gmail.com>

cbandy force-pushed the fast-overflow branch from cc542c3 to 4d9af76 Compare March 7, 2025 02:46

zeroshade approved these changes Mar 10, 2025

View reviewed changes

zeroshade merged commit 2c14a75 into apache:main Mar 10, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(overflow): Speed up overflow checks for small integers#303

perf(overflow): Speed up overflow checks for small integers#303
zeroshade merged 2 commits intoapache:mainfrom
cbandy:fast-overflow

cbandy commented Mar 6, 2025 •

edited

Loading

Uh oh!

cbandy commented Mar 6, 2025

Uh oh!

cbandy commented Mar 7, 2025

Uh oh!

zeroshade commented Mar 7, 2025

Uh oh!

zeroshade left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cbandy commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

cbandy commented Mar 6, 2025

Uh oh!

cbandy commented Mar 7, 2025

Uh oh!

zeroshade commented Mar 7, 2025

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cbandy commented Mar 6, 2025 •

edited

Loading