Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Dec 4, 2023

Rationale for this change

Each filesystem implementation carries unique and potentially heavy dependencies, so it'd be useful to build them separately. Furthermore, one typically doesn't need all of them at the same time and building separate modules would allow them to be dynamically loaded as necessary. Finally, defining this interface allows custom filesystem implementations to be supported seamlessly.

What changes are included in this PR?

An initial sketch of a registry, with documentation as if the registry were complete to illustrate intended usage.

Are these changes tested?

A toy module is added and a single unit test too.

Are there any user-facing changes?

Users would be able to add their own filesystem implementations to the registry

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 5, 2023
@pitrou
Copy link
Member

pitrou commented Dec 9, 2023

Ok, so I think we should provide the building blocks for lazy loading of filesystem libraries when a given URI scheme is requested. Something like:

// filesystem.{h,cc}
using FileSystemFactory = std::function<
    Result<std::shared_ptr<FileSystem>>(
        const Uri& uri, const io::IOContext& io_context, std::string* out_path)>;

using FileSystemLoader = std::function<Status(const std::string& scheme)>;

Status RegisterFileSystemFactory(std::vector<std::string> schemes,
                                 FileSystemFactory factory);

Status RegisterFileSystemLoader(std::vector<std::string> schemes,
                                 FileSystemLoader loader);

Result<FileSystemLoader> MakeSharedLibraryFileSystemLoader(
    std::string library_path, std::string init_function) {
  using InitFuncType = Status(*)();
  auto loader = [=](const std::string& scheme) -> Status {
    ARROW_ASSIGN_OR_RAISE(void* ptr, SharedLibrary::LookupSymbol(library_path, init_symbol));
    return reinterpret_cast<InitFuncType>(ptr)();
  };
}

Status RegisterSharedLibraryFileSystemLoader(
    std::vector<std::string> schemes,
    std::string library_path, std::string init_function) {
  ARROW_ASSIGN_OR_RAISE(auto loader,
    MakeSharedLibraryFileSystemLoader(std::move(library_path), std::move(init_function)));
  return RegisterFileSystemLoader(std::move(schemes), std::move(loader));
}

// io_util.h
class SharedLibrary {
 public:
  static Result<SharedLibrary> Open(const std::string& path);
  Result<void*> LookupSymbol(const std::string& symbol);

  static Result<void*> LookupSymbol(const std::string& path, const std::string& symbol) {
    ARROW_ASSIGN_OR_RAISE(auto lib, Open(path));
    return lib.LookupSymbol(symbol);
  }
};

And then you can use it such as:

// libarrow_s3.so

Status RegisterS3FileSystem() {
  return RegisterFileSystemFactory({"s3"}, &MakeS3FileSystem);
}

// some init code somewhere

Status InitLoadableFileSystems() {
  RETURN_NOT_OK(RegisterSharedLibraryFileSystemLoader(
      {"s3"}, "libarrow_s3.so", "RegisterS3FileSystem");
  // other filesystems here...
  return Status::OK();
}

@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 2cca700 to 3e0ebd2 Compare February 1, 2024 18:44
@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 3e0ebd2 to fb82438 Compare February 1, 2024 19:12
@bkietz
Copy link
Member Author

bkietz commented Feb 1, 2024

I think we should provide the building blocks for lazy loading of filesystem libraries

I don't think this presents much improvement over using direct factories. The main advantage I see with a FileSystemLoader approach is that initialization code can be run as part of loading the filesystem, including calls to EnsureS3Initialized() and dynamic loading of the library.

However EnsureS3Initialized() could equally be called by the factory, or triggered by dynamic loading at the same time as the factory is registered. As for dynamic loading itself: either the library contains one of the built-in filesystems or it is a custom implementation. For the builtin S3 filesystem, it's reasonable that libarrow.so could include the string "libarrow_fs_s3.so" in order to automatically load when an s3:// uri is encountered. For a custom filesystem libarrow.so is unaware of the name of a library containing support for custom:// uris, and can only automatically load if

  • some naming convention is adopted so that we can load libarrow_fs_custom.so
  • the name is passed explicitly to libarrow.so with SetLibraryNameForScheme(uri, libname) which is slightly more complicated than passing the name to LoadLibrary(libname)

@bkietz bkietz force-pushed the 38309-filesystem-modules branch from fb82438 to 8bff100 Compare February 1, 2024 19:38
@pitrou
Copy link
Member

pitrou commented Feb 1, 2024

For the builtin S3 filesystem, it's reasonable that libarrow.so could include the string "libarrow_fs_s3.so" in order to automatically load when an s3:// uri is encountered.

This depends on how libarrow_fs_s3.so is distributed, and which directory it is installed. It is certainly a reasonable default setting, but it can be useful to make it customizable.

@bkietz
Copy link
Member Author

bkietz commented Feb 1, 2024

This depends on how libarrow_fs_s3.so is distributed, and which directory it is installed. It is certainly a reasonable default setting, but it can be useful to make it customizable.

I agree, but again: customization will not be resident in libarrow.so, and will require instructions like "ensure X before calling FileSystemFromUri". That being the case, I don't think any instructions could be simpler than "ensure the library is loaded before calling FileSystemFromUri".

In the specific case of a python package which renames or moves libarrow_fs_s3.so, we disable autoloading and have pyarrow/__init__.py call cffi.FFI.dlopen() with the correct path.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 2, 2024
@bkietz
Copy link
Member Author

bkietz commented Feb 2, 2024

In light of the lack of an obvious approach, I think I'll defer support for autoloading for the moment. The main feature of interest is the registry anyway, and adding autoloading will not be more difficult after the registry is defined.

@kou
Copy link
Member

kou commented Feb 4, 2024

It makes sense.

@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 8bff100 to abaed2e Compare February 7, 2024 19:24
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 7, 2024
@bkietz bkietz force-pushed the 38309-filesystem-modules branch from abaed2e to f95dc66 Compare February 8, 2024 03:01
@bkietz bkietz marked this pull request as ready for review February 8, 2024 03:02
@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 094f3a7 to 4925905 Compare March 6, 2024 21:57
@bkietz
Copy link
Member Author

bkietz commented Mar 6, 2024

@github-actions crossbow submit -g cpp -g r -g python -g wheel

@github-actions
Copy link

github-actions bot commented Mar 6, 2024

Revision: 4925905ea42d9571f65237603b0acc7547c9730f

Submitted crossbow builds: ursacomputing/crossbow @ actions-348ec81e98

Task Status
r-binary-packages GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-cpp GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-debian-11-python-3-amd64 Azure
test-debian-11-python-3-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-fedora-39-python-3 Azure
test-fedora-r-clang-sanitizer Azure
test-r-arrow-backwards-compatibility GitHub Actions
test-r-depsource-bundled Azure
test-r-depsource-system GitHub Actions
test-r-dev-duckdb GitHub Actions
test-r-devdocs GitHub Actions
test-r-gcc-11 GitHub Actions
test-r-gcc-12 GitHub Actions
test-r-install-local GitHub Actions
test-r-install-local-minsizerel GitHub Actions
test-r-linux-as-cran GitHub Actions
test-r-linux-rchk GitHub Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal GitHub Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-debian-gcc-release-custom-ccache Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-ver-latest Azure
test-r-rstudio-r-base-4.1-opensuse153 Azure
test-r-rstudio-r-base-4.2-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-4.2-focal Azure
test-r-ubuntu-22.04 GitHub Actions
test-r-versions GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-r-sanitizer Azure
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@pitrou
Copy link
Member

pitrou commented Mar 7, 2024

The test-ubuntu-22.04-cpp-20 build failure seems unexpected. Can you perhaps try it locally and look through the build logs?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 7, 2024
@bkietz
Copy link
Member Author

bkietz commented Mar 7, 2024

The failure is in the orc external project. Since orc doesn't depend on arrow I think this must be spurious, and the retry has succeeded.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look very neat. The remaining comments are minor. Thank you!

@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 4925905 to c5d03eb Compare March 7, 2024 16:29
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 7, 2024
@bkietz bkietz force-pushed the 38309-filesystem-modules branch from c5d03eb to 74e0605 Compare March 7, 2024 16:34
@bkietz bkietz force-pushed the 38309-filesystem-modules branch from 74e0605 to 2987f7c Compare March 7, 2024 16:48
@bkietz
Copy link
Member Author

bkietz commented Mar 7, 2024

Appveyor and macos failures seem unrelated.

@bkietz bkietz merged commit b235f83 into apache:main Mar 14, 2024
@bkietz bkietz removed the awaiting change review Awaiting change review label Mar 14, 2024
bkietz added a commit that referenced this pull request Mar 14, 2024
### Rationale for this change

Failure to rebase and build when merging #39067 (which renamed `internal::Uri` -> `util::Uri`) led to a merge conflict since #40325 added more usages of `internal::Uri`

### What changes are included in this PR?
Rename internal::Uri -> util::Uri

### Are these changes tested?
Yes

### Are there any user-facing changes?
No

* GitHub Issue: #40562

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@jorisvandenbossche
Copy link
Member

Seems this caused some nightly failures in the minimal C++ builds, see

/usr/bin/ld: /usr/local/lib/libarrow.a(io_util.cc.o): in function `arrow::internal::LoadDynamicLibrary(char const*) [clone .localalias]':
io_util.cc:(.text+0x56e0): undefined reference to `dlopen'
/usr/bin/ld: io_util.cc:(.text+0x5721): undefined reference to `dlerror'
/usr/bin/ld: /usr/local/lib/libarrow.a(io_util.cc.o): in function `arrow::internal::GetSymbol(void*, char const*)':
io_util.cc:(.text+0x58c3): undefined reference to `dlsym'
/usr/bin/ld: io_util.cc:(.text+0x59c1): undefined reference to `dlerror'
collect2: error: ld returned 1 exit status

bkietz added a commit that referenced this pull request Mar 16, 2024
#40578)

### Rationale for this change

When linking statically, pkg-config doesn't pick up the new dependency on libdl introduced by #39067

This produces [unresolved symbol errors](#39067 (comment))

### What changes are included in this PR?

Addition of `-ldl` to `ARROW_PC_LIBS_PRIVATE` to ensure linkage to the necessary library

### Are these changes tested?

yes

### Are there any user-facing changes?

no

* GitHub Issue: #40577

Authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit b235f83.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request Mar 28, 2024
### Rationale for this change

Module library `arrow_filesystem_example` is introduced in #39067 for filesystem testing:
https://github.com/apache/arrow/blob/6cecbab5172b2b339277dde741bfff455646eb32/cpp/src/arrow/testing/CMakeLists.txt#L25

However when built with TSAN, linker flags such as `-fsanitize=thread` is not set, causing the link error in #40863.

### What changes are included in this PR?

Add necessary linker flags for module library.

### Are these changes tested?

Manually tested.

### Are there any user-facing changes?

None.

* GitHub Issue: #40863

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…40864)

### Rationale for this change

Module library `arrow_filesystem_example` is introduced in apache#39067 for filesystem testing:
https://github.com/apache/arrow/blob/6cecbab5172b2b339277dde741bfff455646eb32/cpp/src/arrow/testing/CMakeLists.txt#L25

However when built with TSAN, linker flags such as `-fsanitize=thread` is not set, causing the link error in apache#40863.

### What changes are included in this PR?

Add necessary linker flags for module library.

### Are these changes tested?

Manually tested.

### Are there any user-facing changes?

None.

* GitHub Issue: apache#40863

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] Allow filesystems to be implemented in separate library (+ move remote filesystems out of libarrow into their own shared libraries)

5 participants