Skip to content

Replace EnhancedFileSystem with fsspec#128

Merged
npennequin merged 3 commits intomasterfrom
fsspec
Nov 20, 2024
Merged

Replace EnhancedFileSystem with fsspec#128
npennequin merged 3 commits intomasterfrom
fsspec

Conversation

@npennequin
Copy link
Copy Markdown
Collaborator

@npennequin npennequin commented Nov 12, 2024

The current implementation of EnhancedFileSystem is based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825).

We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface.

Resolves #87

@npennequin npennequin force-pushed the fsspec branch 4 times, most recently from d142367 to af290d6 Compare November 13, 2024 17:01
@npennequin npennequin changed the title Replace EnhancedFileSystem/EnhancedHdfsFile with fsspec Replace EnhancedFileSystem with fsspec Nov 14, 2024
PyArrow has stopped supporting it since 13.0.0
The mocking setup for packaging.pack_spec_in_pex caused upload_spec to
copy the pex file with the same source and destination.
This doesn't fail with EnhancedFileSystem, but it raises an exception
with fsspec filesystems
The current implementation of EnhancedFileSystem is based on the legacy
pyarrow filesystem interface that was removed in pyarrow 16.0.0
(apache/arrow#39825).

We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec
relies on the new pyarrow filesystem interface.

Behavior change note: for put, fsspec doesn't preserve file permissions

Resolves #87

remote_file = f"{temp_dir}/copied_script.sh"
fs.put(file, remote_file)
fs.chmod(remote_file, 0o755)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will probably lead to issues when users will want to retrieve a pex from hdfs and execute it, as this was handled by the preserve_acls in the EnhancedFileSystem.

Since pyarrow doesn't provide a chmod any more, I don't have a simple way to ensure that original permissions are preserved.

Another way to solve the issue would be to to the chmod locally when the pex is retrieved from hdfs by cluster_pack, but depending on the usage, this solution would be different for each use case.

for example in the spark_example.py:

#this creates an executable pex, and uploads it on hdfs (previously preserving permissions)
 package_path, _ = cluster_pack.upload_env()
 
#this adds the hdfs path of the pex in "spark.yarn.dist.files" so it is downloaded by spark executors, and
#sets PYSPARK_PYTHON to the local pex file retrieved from hdfs in the executors.
spark_config_builder.add_packaged_environment(ssb, package_path)

It seems that if we have lost the executable permission when going from local to hdfs, it won't be executable on executor's side

@jcuquemelle jcuquemelle marked this pull request as ready for review November 20, 2024 16:59
@npennequin npennequin merged commit fe938c1 into master Nov 20, 2024
npennequin added a commit that referenced this pull request Nov 26, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin added a commit that referenced this pull request Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin added a commit that referenced this pull request Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.

For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.

However since we're using fsspec we can generalize to all its supported
file system implementations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use fsspec instead of custom generic filesystem

2 participants