Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Jul 21, 2022

This is to resolve ARROW-16818.

@pitrou
Copy link
Member

pitrou commented Jul 21, 2022

Also cc @wjones127

@wjones127
Copy link
Member

wjones127 commented Jul 21, 2022

Based on my experience with the R docs, here a few specific things we should note:

  • The default value for retry_time_limit is 15 minutes, but users may often wish to lower this to more like 15 seconds. This is especially important to note because of https://issues.apache.org/jira/browse/ARROW-17020
  • To connect to public buckets, you must pass anonymous=True if you haven't configured credentials elsewhere. The easiest way to configure credentials is with the CLI command: gcloud auth application-default login. See https://issues.apache.org/jira/browse/ARROW-17069
  • Unlike S3FileSystem, GcsFileSystem will only return directories in GetFileInfo if there are special directory markers created by Arrow. Basically this means that if a directory structure was created by some mechanism other than Arrow's GcsFileSystem (as is the case in our voltrondata-labs-datasets bucket), the "directories" will be invisible. Users are therefore encouraged to always list with recursive=True in GcsFileSystem. See https://issues.apache.org/jira/browse/ARROW-17020

@github-actions
Copy link

@rok
Copy link
Member Author

rok commented Jul 22, 2022

Thanks for the suggestions @wjones127! I've added retry limit, anonymous connection and recursive parameters to the example, but didn't specifically comment on it.

@emkornfield already wrote the docs, this mostly enables building them, with the exception of adding this example:
image
to https://arrow.apache.org/docs/python/filesystems.html

This now renders:
image
compared to: https://arrow.apache.org/docs/dev/search.html?q=gcsfilesystem

And:
image
compared to: https://arrow.apache.org/docs/dev/python/generated/pyarrow.fs.GcsFileSystem.html

@rok
Copy link
Member Author

rok commented Jul 22, 2022

Please note - while I was able to build with ARROW_GCS=ON, I would get the following when importing import pyarrow:

ImportError: dlopen(/Users/rok/Documents/repos/arrow/python/pyarrow/lib.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace (__ZN4absl12lts_2022062310FormatTimeENS0_11string_viewENS0_4TimeENS0_8TimeZoneE)

This looks like another abseil issue but is probably MacOS/my laptop -specific. Has anyone else experienced this? I was able to locally build docs on a nightly build, so it's probably just me.

@rok rok marked this pull request as ready for review July 22, 2022 04:34
@rok rok requested review from emkornfield and ianmcook July 22, 2022 04:34
@AlenkaF
Copy link
Member

AlenkaF commented Jul 22, 2022

I also had issues building with GCS on, is failing on the C++ side already (M1). Will keep trying.
As for the changes in this PR, nothing seems to be missing, giving +1

Co-authored-by: Ian Cook <ianmcook@gmail.com>
@rok
Copy link
Member Author

rok commented Jul 22, 2022

@coryan @emkornfield do you want to verify the abseil issue is limited to MacOS before release is cut (Monday)?

@AlenkaF AlenkaF merged commit 32016b1 into apache:master Jul 22, 2022
@ursabot
Copy link

ursabot commented Jul 23, 2022

Benchmark runs are scheduled for baseline = 38b956f and contender = 32016b1. 32016b1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.24% ⬆️0.0%] test-mac-arm
[Failed ⬇️12.7% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 32016b1a ec2-t3-xlarge-us-east-2
[Failed] 32016b1a test-mac-arm
[Failed] 32016b1a ursa-i9-9960x
[Finished] 32016b1a ursa-thinkcentre-m75q
[Failed] 38b956f4 ec2-t3-xlarge-us-east-2
[Failed] 38b956f4 test-mac-arm
[Failed] 38b956f4 ursa-i9-9960x
[Finished] 38b956f4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Jul 23, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@emkornfield
Copy link
Contributor

emkornfield commented Jul 23, 2022

I think @lidavidm might have been done some upgrade on gRPC/absl after the GCS checkin. I was pretty sure we ran nightlies on mac, but maybe those weren't on M1.

@lidavidm
Copy link
Member

@rok what was the cmake command, and was abseil/gcs/etc. bundled, from conda, from homebrew…?

@wjones127
Copy link
Member

@rok if you are using homebrew you might need to set CMAKE_CXX_STANDARD=17
See #13407

@rok
Copy link
Member Author

rok commented Jul 23, 2022

I'm out at the moment, will check later today. Thanks for the input!

@rok
Copy link
Member Author

rok commented Jul 23, 2022

I'm using the following cmake:

cmake \
	-GNinja \
	-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
	-DCMAKE_INSTALL_LIBDIR=lib \
	-DARROW_FLIGHT=ON \
	-DARROW_GANDIVA=ON \
	-DARROW_GCS=ON \
	-DARROW_INSTALL_NAME_RPATH=OFF \
	-DARROW_JEMALLOC=ON \
	-DARROW_MIMALLOC=ON \
	-DARROW_ORC=ON \
	-DARROW_PARQUET=ON \
	-DARROW_PLASMA=ON \
	-DARROW_PROTOBUF_USE_SHARED=ON \
	-DARROW_BUILD_TESTS=ON \
	-DARROW_PYTHON=ON \
	-DARROW_S3=ON \
	-DARROW_WITH_BROTLI=ON \
	-DARROW_WITH_BZ2=ON \
	-DARROW_WITH_LZ4=ON \
	-DARROW_WITH_SNAPPY=ON \
	-DARROW_WITH_UTF8PROC=ON \
	-DARROW_WITH_ZLIB=ON \
	-DARROW_WITH_ZSTD=ON \
	..

All my dependencies (abseil/gcs/..) are installed via homebrew (brew update && brew bundle --file=cpp/Brewfile). I've also recently updated (M1 macOS 12.5 (21G72), Darwin 21.6.0) and homebrew and the issue started afterwards.

If I cmake with -DARROW_BUILD_TESTS=ON I get this at compile time:

Undefined symbols for architecture arm64:
  "absl::lts_20220623::FormatTime(absl::lts_20220623::string_view, absl::lts_20220623::Time, absl::lts_20220623::TimeZone)", referenced from:
      arrow::fs::(anonymous namespace)::GcsIntegrationTest_OpenInputStreamReadMetadata_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::optional_internal::throw_bad_optional_access()", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ToEncryptionKey_Test::TestBody() in gcsfs_test.cc.o
      arrow::fs::(anonymous namespace)::GcsFileSystem_ToObjectMetadata_Test::TestBody() in gcsfs_test.cc.o
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::ParseTime(absl::lts_20220623::string_view, absl::lts_20220623::string_view, absl::lts_20220623::Time*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o
ld: symbol(s) not found for architecture arm64

Otherwise compilation runs fine and things only fail when I import in Python or R.
I'll keep trying as I don't have a building system now.

@lidavidm
Copy link
Member

I suppose if we can figure out which abseil library is supposed to have that symbol, we can add it to the list of abseil libraries to link against in ThirdpartyToolchain.cmake. I don't have a Mac, and I don't exactly want to screw around with the build machines to debug this (since I usually SSH into those for reproducing Mac issues), so it'll be a little hard for me to help here 🙁

@rok
Copy link
Member Author

rok commented Jul 23, 2022

@lidavidm thanks for thinking about it though! 😄
And yeah, I'll play with abseil linking a bit.

@rok
Copy link
Member Author

rok commented Jul 24, 2022

I'm still not able to build GCS + tests, however GCS without tests builds fine and gives me a working build.
#13407 didn't work, not sure why.

This is the cmake (GCS + tests) that's erroring:

cmake \
	-GNinja \
	-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
	-DCMAKE_INSTALL_LIBDIR=lib \
	-DARROW_PYTHON=ON \
	-DARROW_COMPUTE=ON \
	-DARROW_FILESYSTEM=ON \
	-DARROW_CSV=ON \
	-DARROW_GCS=ON \
	-DARROW_INSTALL_NAME_RPATH=OFF \
	-DARROW_BUILD_TESTS=ON \
	-DCMAKE_CXX_STANDARD=17 \
	..

Error from building GCS + tests:

Undefined symbols for architecture arm64:
  "absl::lts_20220623::FormatTime(std::__1::basic_string_view<char, std::__1::char_traits<char> >, absl::lts_20220623::Time, absl::lts_20220623::TimeZone)", referenced from:
      arrow::fs::(anonymous namespace)::GcsIntegrationTest_OpenInputStreamReadMetadata_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::FromChrono(std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000l> > > const&)", referenced from:
      arrow::fs::(anonymous namespace)::GcsIntegrationTest_OpenInputStreamReadMetadata_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::RFC3339_full", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o
      arrow::fs::(anonymous namespace)::GcsIntegrationTest_OpenInputStreamReadMetadata_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::time_internal::cctz::utc_time_zone()", referenced from:
      arrow::fs::(anonymous namespace)::GcsIntegrationTest_OpenInputStreamReadMetadata_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::ToDoubleSeconds(absl::lts_20220623::Duration)", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::Duration::operator-=(absl::lts_20220623::Duration)", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o
  "absl::lts_20220623::ParseTime(std::__1::basic_string_view<char, std::__1::char_traits<char> >, std::__1::basic_string_view<char, std::__1::char_traits<char> >, absl::lts_20220623::Time*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)", referenced from:
      arrow::fs::(anonymous namespace)::GcsFileSystem_ObjectMetadataRoundtrip_Test::TestBody() in gcsfs_test.cc.o

I've also noticed some duplicated symbols:

[280/317] Bundling /Users/rok/Documents/repos/arrow/cpp/build/release/libarrow_bundled_dependencies.a
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (binary_data_as_debug_string.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(binary_data_as_debug_string.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(binary_data_as_debug_string.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (compute_engine_util.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_common.a(compute_engine_util.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(compute_engine_util.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (credentials.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_common.a(credentials.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(credentials.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (curl_handle.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(curl_handle.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(curl_handle.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (curl_handle_factory.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(curl_handle_factory.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(curl_handle_factory.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (curl_wrappers.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(curl_wrappers.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(curl_wrappers.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (escaping.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/absl_ep-install/lib/libabsl_strings_internal.a(escaping.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/absl_ep-install/lib/libabsl_strings.a(escaping.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (make_jwt_assertion.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(make_jwt_assertion.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(make_jwt_assertion.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (openssl_util.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(openssl_util.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(openssl_util.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (throw_delegate.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/absl_ep-install/lib/libabsl_throw_delegate.a(throw_delegate.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_common.a(throw_delegate.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (unified_rest_credentials.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(unified_rest_credentials.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_rest_internal.a(unified_rest_credentials.cc.o) due to use of basename, truncation and blank padding
/Library/Developer/CommandLineTools/usr/bin/libtool: warning same member name (version.cc.o) in output file used for input files: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_common.a(version.cc.o) and: /Users/rok/Documents/repos/arrow/cpp/build/google_cloud_cpp_ep-install/lib/libgoogle_cloud_cpp_storage.a(version.cc.o) due to use of basename, truncation and blank padding

These can be resolved with:

diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake
index 5d1da18b7..c9bbcf3db 100644
--- a/cpp/cmake_modules/ThirdpartyToolchain.cmake
+++ b/cpp/cmake_modules/ThirdpartyToolchain.cmake
@@ -4164,7 +4164,6 @@ macro(build_google_cloud_cpp_storage)
                         absl::variant
                         nlohmann_json::nlohmann_json
                         Crc32c::crc32c
-                        CURL::libcurl
                         Threads::Threads
                         OpenSSL::SSL
                         OpenSSL::Crypto
@@ -4193,9 +4192,7 @@ macro(build_google_cloud_cpp_storage)
          absl::raw_logging_internal
          absl::spinlock_wait
          absl::strings
-         absl::strings_internal
          absl::str_format_internal
-         absl::throw_delegate
          absl::time
          absl::time_zone
          Crc32c::crc32c)

I'm not sure how much of a problem this really is, but I'll open a Jira to make it more visible.

@rok
Copy link
Member Author

rok commented Jul 24, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants