-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15300: [C++] Update Skyhook for async dataset interfaces #12123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@github-actions crossbow submit -g test-skyhook-integration |
|
|
|
@github-actions crossbow submit test-skyhook-integration |
|
|
Revision: d196ca92142e0cc2343d08b8983064040051da6e Submitted crossbow builds: ursacomputing/crossbow @ actions-1386
|
|
@github-actions crossbow submit test-skyhook-integration |
|
Revision: b997f13cdbb718725116dfcf3a6df4e0371f13c9 Submitted crossbow builds: ursacomputing/crossbow @ actions-1387
|
|
@JayjeetAtGithub I'm trying to update this since we removed an API for 7.0.0, do you have any idea about the failure here? https://github.com/ursacomputing/crossbow/runs/4778154378?check_suite_focus=true (I'm also having trouble with it locally, as it just hangs in TestSkyhookCLS.SelectEntireDataset, but that seems to be a problem with my own setup.) |
|
Hmm, I can get it running locally now, interestingly that crash doesn't happen all the time. Under gdb, it appears this happens during thread shutdown…during which there's not much info left about what's going on, unfortunately. AddressSanitizer does report a heap-use-after-free after test exit. I compared to the commit prior to 2e8b836 and saw that it also has a heap-use-after-free, so it seems unrelated to what's going on here. (Or possibly, it's just that the changes here are just exposing it.) |
|
Fiddling with the library linkage appears to fix everything…I did this since I noticed the symbol being reported as a use-after-free was actually being defined in both the test executable and libparquet, which seemed wrong (and consequently it was getting freed twice). Now to figure out what the right way to link things is… |
|
@github-actions crossbow submit test-skyhook-integration |
|
Revision: 819469a Submitted crossbow builds: ursacomputing/crossbow @ actions-1388
|
|
Ok, a little CMake tweak appears to fix it! Should be ready for review now. |
|
@lidavidm Thanks a lot for working on this. Sorry for the late reply. Let me take a look. |
|
Looks like the tests are passing. Please let me know if I can help with something. Thanks. |
|
Ah, yes, I managed to fix the tests later on. Does the implementation look reasonable to you @JayjeetAtGithub? It's mostly the same code ported to the new interfaces. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting to this!
| // TODO: investigate "true" async version | ||
|
|
||
| // Make sure client-side filtering and projection is turned off. | ||
| file->apply_compute = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flag was only ever checked in the synchronous path. Now that it is gone this flag is somewhat ineffective. A more accurate thing to do would be to attach the filter as a guarantee to all batches produced by this format.
Skipping scanner projection might still be nice I suppose but its more niche and a little trickier on the async / exec plan path. For example in AsyncScanner::ScanBatchesUnorderedAsync if the user is using Skyhook then instead of...
auto exprs = scan_options_->projection.call()->arguments;
auto names = checked_cast<const compute::MakeStructOptions*>(
scan_options_->projection.call()->options.get())
->field_names;
...we would need the same names but the exprs should all be field_refs onto the names themselves. For example, if the user asked for expr=call("multiply", {field_ref("a"), literal(2)}); and name="a_times_two" then we would pass that expr/name down to Skyhook and in the augmented_project node we could pass expr=field_ref("a_times_two") and name="a_times_two".
I think we can defer this to a follow-up PR if needed. The issue already existed. It will only affect users that are using custom projection (which isn't well documented yet and we haven't seen much of).
Also, the double-filter-if-running-async behavior already existed and is harmless except for a minor hit to performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. I've gone and removed the flag. We can file a follow up for this case.
| const std::shared_ptr<arrow::dataset::ScanOptions>& options, | ||
| const std::shared_ptr<arrow::dataset::FileFragment>& file) const { | ||
| /// Make sure client-side filtering and projection is turned off. | ||
| // TODO: investigate "true" async version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more specific what a true async version would do? Since this is basically making a network request and waiting for it to finish I don't know that we need to do much more than you've already done here. The only thing I could think of would be maybe transferring to the CPU thread pool before deserializing the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I don't think there's much else we can do unless Ceph itself exposes an async interface for this.
|
@github-actions crossbow submit test-skyhook-integration |
|
Revision: 72d8216 Submitted crossbow builds: ursacomputing/crossbow @ actions-1392
|
|
Benchmark runs are scheduled for baseline = d1fb8e7 and contender = 2ec4e99. 2ec4e99 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
After the synchronous dataset interfaces were removed in ARROW-13554, we need to update Skyhook to use the async interfaces. For now, we just use the same synchronous path as before, but on an I/O thread.