-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8969: [C++] Reduce binary size of kernels/scalar_compare.cc.o by reusing more kernels between types, operators #7461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ursabot benchmark --benchmark_filter=Greater 18e559b |
|
|
@ursabot benchmark --benchmark-filter=Greater 18e559b |
|
AMD64 Ubuntu 18.04 C++ Benchmark (#112653) builder has been succeeded. Revision: 53671af32c338fcca1edc40732c4c5fd1ad7585e ==================================== ================== ================== ========
benchmark baseline contender change
==================================== ================== ================== ========
GreaterArrayArrayString/32768/2 53.657m items/sec 94.257m items/sec 75.665%
- GreaterArrayArrayInt64/32768/2 363.591m items/sec 337.022m items/sec -7.307%
GreaterArrayScalarInt64/32768/100 598.659m items/sec 701.841m items/sec 17.236%
GreaterArrayScalarString/32768/0 341.707m items/sec 463.243m items/sec 35.567%
GreaterArrayArrayString/32768/100 54.363m items/sec 75.156m items/sec 38.249%
- GreaterArrayArrayInt64/32768/1 364.615m items/sec 337.861m items/sec -7.338%
- GreaterArrayArrayInt64/32768/0 365.779m items/sec 338.697m items/sec -7.404%
GreaterArrayArrayString/32768/10000 54.734m items/sec 75.788m items/sec 38.465%
GreaterArrayScalarString/32768/1 364.271m items/sec 569.761m items/sec 56.411%
GreaterArrayScalarInt64/32768/2 596.674m items/sec 681.189m items/sec 14.164%
- GreaterArrayArrayInt64/32768/10000 364.433m items/sec 338.039m items/sec -7.242%
GreaterArrayArrayString/32768/1 127.275m items/sec 195.144m items/sec 53.325%
GreaterArrayScalarInt64/32768/10 614.042m items/sec 697.835m items/sec 13.646%
GreaterArrayScalarInt64/32768/1 613.585m items/sec 692.598m items/sec 12.877%
- GreaterArrayArrayInt64/32768/100 360.385m items/sec 337.343m items/sec -6.394%
GreaterArrayScalarString/32768/100 331.392m items/sec 447.519m items/sec 35.042%
GreaterArrayScalarInt64/32768/0 624.860m items/sec 705.244m items/sec 12.864%
- GreaterArrayArrayInt64/32768/10 363.073m items/sec 337.156m items/sec -7.138%
GreaterArrayScalarInt64/32768/10000 608.171m items/sec 678.010m items/sec 11.483%
GreaterArrayScalarString/32768/10 256.765m items/sec 352.403m items/sec 37.247%
GreaterArrayScalarString/32768/10000 335.759m items/sec 461.788m items/sec 37.535%
GreaterArrayArrayString/32768/0 54.130m items/sec 75.873m items/sec 40.168%
GreaterArrayScalarString/32768/2 130.823m items/sec 194.535m items/sec 48.701%
GreaterArrayArrayString/32768/10 51.190m items/sec 71.220m items/sec 39.127%
==================================== ================== ================== ======== |
|
Not sure why some of the Int64 benchmarks are slower (AFAIK nothing changed about these so it may be noise) but the string comparisons got a lot faster. I had to engage in some slight shenanigans to make temporal Scalar types have a common base type so that we can use e.g. |
|
There weren't any C++ unit tests for comparisons of primitive types so I addressed that, and also added comparisons for Time and Duration types (which on account of this patch are basically "free") |
|
@ursabot benchmark --benchmark-filter=Greater 18e559b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou anecdotally it appears this is meaningfully faster than the prior version which used GetView
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably depend on the compiler? Logically, it's quite similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the proof is in the benchmarks. I can start writing microbenchmarks for these internal utilities to help us avoid having arguments about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think micro-benchmarks are useful here, as actual performance will depend on overall optimization of the calling function by the compiler.
In any case, I don't dispute the numbers. But it also seems the improvements would be mostly compiler-dependent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right well AFAICT this change is responsible for the 40-50% performance improvement. So I'm calling it out because there are other places where we use the GetView method that may therefore be performing suboptimally (even if only on some compilers). If it were only 5% then it would be inconclusive.
|
AMD64 Ubuntu 18.04 C++ Benchmark (#112703) builder has been succeeded. Revision: 301ffa539e634f2c464ca072cd5c543f1407f1f7 ==================================== ================== ================== ========
benchmark baseline contender change
==================================== ================== ================== ========
GreaterArrayArrayString/32768/2 53.694m items/sec 93.892m items/sec 74.864%
GreaterArrayScalarInt64/32768/2 616.025m items/sec 681.971m items/sec 10.705%
GreaterArrayScalarString/32768/2 132.267m items/sec 216.365m items/sec 63.583%
- GreaterArrayArrayInt64/32768/10000 365.572m items/sec 339.713m items/sec -7.074%
GreaterArrayArrayString/32768/100 53.635m items/sec 74.222m items/sec 38.384%
- GreaterArrayArrayInt64/32768/100 365.263m items/sec 339.865m items/sec -6.953%
GreaterArrayScalarString/32768/10000 341.705m items/sec 547.337m items/sec 60.178%
- GreaterArrayArrayInt64/32768/1 365.608m items/sec 340.598m items/sec -6.841%
GreaterArrayArrayString/32768/0 55.046m items/sec 74.164m items/sec 34.733%
GreaterArrayScalarInt64/32768/10000 551.861m items/sec 679.554m items/sec 23.139%
GreaterArrayScalarString/32768/1 360.301m items/sec 724.313m items/sec 101.030%
GreaterArrayScalarInt64/32768/10 618.148m items/sec 669.015m items/sec 8.229%
- GreaterArrayArrayInt64/32768/2 365.552m items/sec 340.107m items/sec -6.961%
GreaterArrayScalarString/32768/100 332.135m items/sec 529.815m items/sec 59.518%
- GreaterArrayArrayInt64/32768/0 367.549m items/sec 341.912m items/sec -6.975%
- GreaterArrayArrayInt64/32768/10 365.734m items/sec 339.651m items/sec -7.131%
GreaterArrayScalarInt64/32768/100 604.015m items/sec 667.332m items/sec 10.483%
GreaterArrayScalarString/32768/10 258.597m items/sec 411.304m items/sec 59.052%
GreaterArrayScalarInt64/32768/0 622.436m items/sec 671.837m items/sec 7.937%
GreaterArrayArrayString/32768/1 130.420m items/sec 196.548m items/sec 50.704%
GreaterArrayScalarInt64/32768/1 622.602m items/sec 699.200m items/sec 12.303%
GreaterArrayArrayString/32768/10 51.206m items/sec 70.119m items/sec 36.935%
GreaterArrayScalarString/32768/0 337.556m items/sec 549.035m items/sec 62.650%
GreaterArrayArrayString/32768/10000 54.218m items/sec 74.885m items/sec 38.117%
==================================== ================== ================== ======== |
|
Ah! It's because Greater is now implemented using Less. Let me switch things around so things are based on Greater/GreaterEqual instead |
|
@ursabot benchmark --benchmark-filter=Greater 18e559b |
|
Not to beat a dead horse about ARROW-9155, but the turnaround time for simple benchmarks isn't great |
|
To help follow along it would be handy if references to JIRA could be autolinked: That would save the lurker from having to search for it or the poster having to bother about pasting an actual URL. |
|
Well we have these bot comments, is it not sufficient? #7461 (comment) |
|
AMD64 Ubuntu 18.04 C++ Benchmark (#112729) builder has been succeeded. Revision: 74caaae25e3bd95d57f3f6d9b835c2610639ab41 ==================================== ================== ================== ========
benchmark baseline contender change
==================================== ================== ================== ========
GreaterArrayScalarString/32768/2 130.267m items/sec 224.418m items/sec 72.276%
GreaterArrayScalarString/32768/10 261.021m items/sec 498.491m items/sec 90.977%
- GreaterArrayArrayInt64/32768/2 361.493m items/sec 335.383m items/sec -7.223%
GreaterArrayScalarInt64/32768/10 565.966m items/sec 690.931m items/sec 22.080%
GreaterArrayScalarInt64/32768/100 555.331m items/sec 668.238m items/sec 20.331%
- GreaterArrayArrayInt64/32768/10000 364.755m items/sec 338.493m items/sec -7.200%
- GreaterArrayArrayInt64/32768/1 362.806m items/sec 336.071m items/sec -7.369%
GreaterArrayScalarInt64/32768/1 573.242m items/sec 704.449m items/sec 22.889%
GreaterArrayArrayString/32768/10000 54.921m items/sec 73.848m items/sec 34.463%
GreaterArrayArrayString/32768/2 52.950m items/sec 92.643m items/sec 74.964%
GreaterArrayScalarString/32768/1 361.437m items/sec 653.553m items/sec 80.821%
GreaterArrayScalarInt64/32768/0 555.044m items/sec 701.556m items/sec 26.396%
GreaterArrayScalarInt64/32768/10000 575.762m items/sec 682.164m items/sec 18.480%
- GreaterArrayArrayInt64/32768/100 361.216m items/sec 338.015m items/sec -6.423%
GreaterArrayScalarString/32768/0 339.144m items/sec 707.998m items/sec 108.760%
GreaterArrayArrayString/32768/1 127.597m items/sec 193.087m items/sec 51.326%
GreaterArrayArrayString/32768/10 50.585m items/sec 69.612m items/sec 37.613%
GreaterArrayArrayString/32768/100 54.563m items/sec 73.054m items/sec 33.889%
GreaterArrayScalarInt64/32768/2 558.941m items/sec 697.482m items/sec 24.786%
GreaterArrayScalarString/32768/100 331.214m items/sec 669.210m items/sec 102.048%
- GreaterArrayArrayInt64/32768/10 361.791m items/sec 336.834m items/sec -6.898%
GreaterArrayArrayString/32768/0 54.614m items/sec 73.456m items/sec 34.499%
GreaterArrayScalarString/32768/10000 342.276m items/sec 705.190m items/sec 106.030%
- GreaterArrayArrayInt64/32768/0 363.455m items/sec 336.738m items/sec -7.351%
==================================== ================== ================== ======== |
|
Well my theory about greater/less didn't hold. The other relevant change was moving things into the anonymous namespace. It's possible that anonymous namespaces impact inlining somehow |
The bot is fine - I guess it links to whatever JIRA is listed in the title. That doesn't help if someone mentions a JIRA in a comment. If the autolink was configured for this repo that reference would have automatically been converted to It's just a very minor thing, but it does help the DX. |
|
Can you bring it up somewhere else like on the mailing list? Someone can ask Apache INFRA to set this up |
|
This can be reviewed, but I realized I can compress this further by reusing some equals/not_equal kernels, so I will quickly try to do that this morning |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only took a cursory look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably depend on the compiler? Logically, it's quite similar.
cpp/src/arrow/scalar.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this use of inheritance is kosher. This implies TimestampScalar isa Int64Scalar, which is a relationship not present in parallel hierarchies (Array, DataType, Builder) and is therefore likely to be confusing. Why not use TemporalScalar : PrimitiveScalar<T>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int64Scalar and TimestampScalar need to have a common base type in order to both be unboxed by templated code that uses Int64Type in the kernel codegen. So PrimitiveScalar<TimestampType> won't work.
Finish paring down compare, implement needed simplification of kernel generators Benchmark improvements
…r storage siblings
|
I'm revamping the documentation about these codegen functions which I'm dubbing "Generator-Dispatchers" (GDs) for short. I'll add "Generate" to their name. Stay tuned |
|
+1. I'm going to merge this to help avoid conflicts caused by the stuff I just renamed. I welcome further comments and I will work to address them in follow ups |
|
I gave up on trying to have e.g. a common "64-bit" kernel for Equals/NotEquals Int64/UInt64/Timestamp/etc. The sticking point is scalar unboxing. We might need to fashion a common internal base class for primitive types and give them a virtual method that returns their data as |
|
I just requested JIRA reference autolinking https://issues.apache.org/jira/browse/INFRA-20450 |
With clang-8 on Linux this unit is now 654KB down from 1257KB.
A few strategies:
On the latter matter, I had to rewrite some of the codegen_internal.h routines where they assumed they would receive the specific logical type as the compile-time parameter (possibly causing a conflict when trying to put a BinaryArray box around a StringArray's ArrayData). This makes kernel execution against string data faster in the places where
ArrayIteratoris used. Will run benchmarksAs an aside, in general these kernel code generators will likely have lower and lower level code over time.