-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11989: [C++][Python] Improve ChunkedArray's complexity for the access of elements #12055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11989: [C++][Python] Improve ChunkedArray's complexity for the access of elements #12055
Conversation
|
Here are some preliminary measurements showing an improvement for a particular example (1024 chunks each with 1024 values and using an incremental search). The following are still missing in this PR:
|
|
I think we should simply factor out and reuse the chunk resolver from |
|
@pitrou Thanks! I had no idea of the existence of |
|
Sorry. I should have mentioned it on the JIRA. |
|
@edponce Are you willing to push this forward? |
|
Hi @eelxpeng, I will work on this today. Apologies for the delay. |
7eec701 to
afd4022
Compare
45167f3 to
95a3d30
Compare
|
Verified the speed improvement on the example provided above. Thanks for the good work! @edponce I have one question, does the change revises the arrow file and saves necessary information in the file? Or is it simply does not touch the arrow file structure but just resolve the index with more efficient search? I asked this because I experienced longer access time for existing large multi-chunk arrow file with the change. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @edponce, here are some comments.
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 93b192c and contender = 95a3d30e5c78cea6ea40748e268ceb270b3092a8. Results will be available as each benchmark for each run completes. |
95a3d30 to
adaf18b
Compare
b137a83 to
a711336
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 5592cf7 and contender = a711336db6a3095a4be320adceaf147826c2bdaf. Results will be available as each benchmark for each run completes. |
a711336 to
7ac8aef
Compare
7ac8aef to
dda5d95
Compare
|
There are 2 main solutions for the type of
Based on this info, I favor the |
dda5d95 to
5f709cc
Compare
b68ea2a to
6515bdc
Compare
|
@pitrou This PR is ready for final reviews. Thanks for all the great reviews. |
beeafb7 to
9f7d8a6
Compare
9f7d8a6 to
76b25ff
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for baseline = 5d5cceb and contender = 76b25ff. Results will be available as each benchmark for each run completes. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, will just wait for benchmark results
|
Benchmark runs are scheduled for baseline = 7dc8683 and contender = 4e49aa8. 4e49aa8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
Not sure about why. I have a large arrow file with 115545630 rows and 488 chunks. Even using the fix in this pr, the time to randomly access 1000 rows is 2102 seconds. While if I re-process the arrow file to make it 115545630 rows and 1 chunk, the time is 72 seconds. The difference is huge. |
|
Hi @eelxpeng, the implementation in this PR caches the previous chunk used, so it is expected to be faster when using 1 chunk because it results in perfect caching. Randomly accessed chunks are searched using a binary search and the chunk caching operations would only add a very small penalty instead. How do your results compare to Arrow before this patch got merged? |
|
I can set up benchmarks this week to investigate further performance with varying chunk sizes. |
This sounds really huge and unrelated to this PR. Are you accessing the file over local storage or a remote filesystem? |
|
@edponce with previous Arrow version (7.0.0) before this patch, the time is 1510 seconds for the 488 chunks file. |
|
@pitrou How does it not related to this PR? In this case, 1 chunk is significantly faster than 488 chunks. I'm accessing via efs (https://aws.amazon.com/efs/). |
|
It's not related because the runtimes are incompatible with the CPU time needed for calculating chunk indices. That said, it's worth opening an issue for this. |
Improves search time for finding a chunk in ChunkedArray using a binary search, O(log n) for random access.
Chunks are searched when invoking
GetScalar()(C++) and index operator (Python).