Skip to content

Conversation

@zagto
Copy link
Contributor

@zagto zagto commented Apr 22, 2022

I see big improvements in the ExecuteScalarExpressionOverhead benchmarks, especially ref_only_expression with this, for example:

before:
ExecuteScalarExpressionOverhead/ref_only_expression/rows_per_batch:1000000/real_time/threads:1         35.4 ns         35.4 ns     19577007 batches_per_second=28.2395M/s rows_per_second=28.2395T/s
ExecuteScalarExpressionOverhead/ref_only_expression/rows_per_batch:1000000/real_time/threads:16        49.8 ns          788 ns     14280992 batches_per_second=20.0734M/s rows_per_second=20.0734T/s

after:
ExecuteScalarExpressionOverhead/ref_only_expression/rows_per_batch:1000000/real_time/threads:1         27.6 ns         27.5 ns     25090317 batches_per_second=36.2832M/s rows_per_second=36.2832T/s
ExecuteScalarExpressionOverhead/ref_only_expression/rows_per_batch:1000000/real_time/threads:16        4.26 ns         67.2 ns    184745728 batches_per_second=235.006M/s rows_per_second=235.006T/s

Also the overhead of small batch size/multithreaded benchmarks is reduced, for example in complex_expression:

before:
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:1          3723682 ns      3721326 ns          191 batches_per_second=268.551k/s rows_per_second=268.551M/s
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:16         1153070 ns     18365265 ns          624 batches_per_second=867.25k/s rows_per_second=867.25M/s

after:
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:1          3543745 ns      3541909 ns          197 batches_per_second=282.187k/s rows_per_second=282.187M/s
ExecuteScalarExpressionOverhead/complex_expression/rows_per_batch:1000/real_time/threads:16          841776 ns     13395266 ns          944 batches_per_second=1.18796M/s rows_per_second=1.18796G/s

@github-actions
Copy link

@lidavidm
Copy link
Member

Note there's some lint failures (though, the git version means a lot of pipelines are broken)

@zagto
Copy link
Contributor Author

zagto commented Apr 22, 2022

Note there's some lint failures (though, the git version means a lot of pipelines are broken)

oops you're right. should be fixed now

@lidavidm
Copy link
Member

CI should pass if you rebase here

@kszucs kszucs force-pushed the datatype-performace-expression-type branch from 9c10b3f to 9e3a56d Compare April 22, 2022 16:30
@zagto zagto force-pushed the datatype-performace-expression-type branch from 9e3a56d to de6c654 Compare April 22, 2022 16:31
@lidavidm lidavidm closed this in e3b84b0 Apr 22, 2022
@lidavidm
Copy link
Member

Thanks for catching this!

@ursabot
Copy link

ursabot commented Apr 27, 2022

Benchmark runs are scheduled for baseline = e1e782a and contender = e3b84b0. e3b84b0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️12.78% ⬆️2.63%] ursa-i9-9960x
[Finished ⬇️0.55% ⬆️0.42%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/596| e3b84b08 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/584| e3b84b08 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/583| e3b84b08 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/594| e3b84b08 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/595| e1e782a4 ec2-t3-xlarge-us-east-2>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/583| e1e782a4 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/582| e1e782a4 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/593| e1e782a4 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 27, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@zagto
Copy link
Contributor Author

zagto commented Apr 27, 2022

looks like the "high level of regressions" are a temporary glich

@jorisvandenbossche
Copy link
Member

Now a few more days have passed and a longer timeline is visible, it doesn't seem to be a temporary change: https://conbench.ursa.dev/compare/runs/8b8a13ad985c4adfbf68518fb9fca07d...d0967da307ff4694b471d4f131c5e2b8/

@zagto
Copy link
Contributor Author

zagto commented May 5, 2022

Interesting. It seems like this commit (which seems unrelated to me) made the graph temporarily go back to normal, making it look like a glitch. This is probably what I saw when writing the above comment.

For example here:
https://conbench.ursa.dev/compare/benchmarks/3c980569f8be4d85a8d54e0ffb93e93c...ed4d1a2fb6a346478f3467cbe308c895/

I guess I'll try if I can reproduce any of that

@zagto
Copy link
Contributor Author

zagto commented May 6, 2022

I now run ran one of the most-affected benchmarks - tpch, arrow, parquet, memory_map=False, TPCH-15, scale_factor=1, R - and saw no such difference on my machine. (provided R picked up the changed C++ library correctly, I'm not very familar with it)

Going trough the benchmark runner results a bit more I noticed that all the extreme permanent changes are R TCPH benchmarks. For example this and this are the worst affected Python and JS benchmarks. Also some TCPH benchmarks see an extreme improvement. Overall this gives me the impression the performance change may be unrelated to this commit.

I wonder if it is possible to run the baseline again on conbench to isolate this?

output-TPCH15.txt

cyb70289 pushed a commit that referenced this pull request May 31, 2022
…rove performance (#13222)

According to conbench there was a slight regression on #12957 .  Poking around a bit it seems that a static local variable is implemented using some kind of global lock (__cxa_guard_acquire / __cxa_guard_release).  On the other hand, constructing an empty shared_ptr is cheap (two pointers are set to 0).  So if we care about performance here we probably don't want `static`.  This may be what is causing the conbench issue.

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants