Skip to content

Conversation

@lidavidm
Copy link
Member

@lidavidm lidavidm commented Jan 11, 2022

After the synchronous dataset interfaces were removed in ARROW-13554, we need to update Skyhook to use the async interfaces. For now, we just use the same synchronous path as before, but on an I/O thread.

@lidavidm
Copy link
Member Author

@github-actions crossbow submit -g test-skyhook-integration

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm
Copy link
Member Author

@github-actions crossbow submit test-skyhook-integration

@github-actions
Copy link

Invalid group(s) {'test-skyhook-integration'}. Must be one of {'verify-rc', 'verify-rc-binaries', 'cpp', 'linux', 'verify-rc-jars', 'verify-rc-source-linux', 'verify-rc-wheels', 'c-glib', 'verify-rc-source-macos', 'fuzz', 'conda', 'verify-rc-source', 'linux-arm64', 'r', 'ruby', 'example', 'example-cpp', 'homebrew', 'nightly', 'linux-amd64', 'packaging', 'integration', 'test', 'python', 'vcpkg', 'wheel'}
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/1683172564```

@github-actions
Copy link

Revision: d196ca92142e0cc2343d08b8983064040051da6e

Submitted crossbow builds: ursacomputing/crossbow @ actions-1386

Task Status
test-skyhook-integration Github Actions

@lidavidm
Copy link
Member Author

@github-actions crossbow submit test-skyhook-integration

@github-actions
Copy link

Revision: b997f13cdbb718725116dfcf3a6df4e0371f13c9

Submitted crossbow builds: ursacomputing/crossbow @ actions-1387

Task Status
test-skyhook-integration Github Actions

@lidavidm
Copy link
Member Author

@JayjeetAtGithub I'm trying to update this since we removed an API for 7.0.0, do you have any idea about the failure here? https://github.com/ursacomputing/crossbow/runs/4778154378?check_suite_focus=true

Running main() from /build/cpp/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from TestSkyhookCLS
[ RUN      ] TestSkyhookCLS.SelectEntireDataset
[       OK ] TestSkyhookCLS.SelectEntireDataset (287 ms)
[ RUN      ] TestSkyhookCLS.SelectFewRows
[       OK ] TestSkyhookCLS.SelectFewRows (98 ms)
[ RUN      ] TestSkyhookCLS.SelectFewColumns
[       OK ] TestSkyhookCLS.SelectFewColumns (120 ms)
[ RUN      ] TestSkyhookCLS.SelectRowsAndColumnsOnPartitionKey
[       OK ] TestSkyhookCLS.SelectRowsAndColumnsOnPartitionKey (96 ms)
[ RUN      ] TestSkyhookCLS.SelectRowsAndColumnsOnlyOnPartitionKey
[       OK ] TestSkyhookCLS.SelectRowsAndColumnsOnlyOnPartitionKey (84 ms)
[----------] 5 tests from TestSkyhookCLS (687 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (688 ms total)
[  PASSED  ] 5 tests.
malloc_consolidate(): invalid chunk size
/arrow/ci/scripts/integration_skyhook.sh: line 135: 16922 Aborted                 (core dumped) debug/skyhook-cls-test

(I'm also having trouble with it locally, as it just hangs in TestSkyhookCLS.SelectEntireDataset, but that seems to be a problem with my own setup.)

@lidavidm
Copy link
Member Author

Hmm, I can get it running locally now, interestingly that crash doesn't happen all the time. Under gdb, it appears this happens during thread shutdown…during which there's not much info left about what's going on, unfortunately.

AddressSanitizer does report a heap-use-after-free after test exit. I compared to the commit prior to 2e8b836 and saw that it also has a heap-use-after-free, so it seems unrelated to what's going on here. (Or possibly, it's just that the changes here are just exposing it.)

@lidavidm
Copy link
Member Author

Fiddling with the library linkage appears to fix everything…I did this since I noticed the symbol being reported as a use-after-free was actually being defined in both the test executable and libparquet, which seemed wrong (and consequently it was getting freed twice). Now to figure out what the right way to link things is…

@lidavidm
Copy link
Member Author

@github-actions crossbow submit test-skyhook-integration

@github-actions
Copy link

Revision: 819469a

Submitted crossbow builds: ursacomputing/crossbow @ actions-1388

Task Status
test-skyhook-integration Github Actions

@lidavidm lidavidm marked this pull request as ready for review January 11, 2022 20:47
@lidavidm
Copy link
Member Author

Ok, a little CMake tweak appears to fix it! Should be ready for review now.

@JayjeetAtGithub
Copy link
Contributor

@lidavidm Thanks a lot for working on this. Sorry for the late reply. Let me take a look.

@JayjeetAtGithub
Copy link
Contributor

Looks like the tests are passing. Please let me know if I can help with something. Thanks.

@lidavidm
Copy link
Member Author

Ah, yes, I managed to fix the tests later on. Does the implementation look reasonable to you @JayjeetAtGithub? It's mostly the same code ported to the new interfaces.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting to this!

// TODO: investigate "true" async version

// Make sure client-side filtering and projection is turned off.
file->apply_compute = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flag was only ever checked in the synchronous path. Now that it is gone this flag is somewhat ineffective. A more accurate thing to do would be to attach the filter as a guarantee to all batches produced by this format.

Skipping scanner projection might still be nice I suppose but its more niche and a little trickier on the async / exec plan path. For example in AsyncScanner::ScanBatchesUnorderedAsync if the user is using Skyhook then instead of...

  auto exprs = scan_options_->projection.call()->arguments;
  auto names = checked_cast<const compute::MakeStructOptions*>(
                   scan_options_->projection.call()->options.get())
                   ->field_names;

...we would need the same names but the exprs should all be field_refs onto the names themselves. For example, if the user asked for expr=call("multiply", {field_ref("a"), literal(2)}); and name="a_times_two" then we would pass that expr/name down to Skyhook and in the augmented_project node we could pass expr=field_ref("a_times_two") and name="a_times_two".

I think we can defer this to a follow-up PR if needed. The issue already existed. It will only affect users that are using custom projection (which isn't well documented yet and we haven't seen much of).

Also, the double-filter-if-running-async behavior already existed and is harmless except for a minor hit to performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I've gone and removed the flag. We can file a follow up for this case.

const std::shared_ptr<arrow::dataset::ScanOptions>& options,
const std::shared_ptr<arrow::dataset::FileFragment>& file) const {
/// Make sure client-side filtering and projection is turned off.
// TODO: investigate "true" async version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific what a true async version would do? Since this is basically making a network request and waiting for it to finish I don't know that we need to do much more than you've already done here. The only thing I could think of would be maybe transferring to the CPU thread pool before deserializing the table.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I don't think there's much else we can do unless Ceph itself exposes an async interface for this.

@lidavidm
Copy link
Member Author

@github-actions crossbow submit test-skyhook-integration

@github-actions
Copy link

Revision: 72d8216

Submitted crossbow builds: ursacomputing/crossbow @ actions-1392

Task Status
test-skyhook-integration Github Actions

@lidavidm lidavidm closed this in 2ec4e99 Jan 12, 2022
@lidavidm lidavidm deleted the arrow-15300 branch January 12, 2022 21:56
@ursabot
Copy link

ursabot commented Jan 12, 2022

Benchmark runs are scheduled for baseline = d1fb8e7 and contender = 2ec4e99. 2ec4e99 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.9% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.31% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants