Replace EnhancedFileSystem with fsspec#128
Merged
npennequin merged 3 commits intomasterfrom Nov 20, 2024
Merged
Conversation
d142367 to
af290d6
Compare
PyArrow has stopped supporting it since 13.0.0
The mocking setup for packaging.pack_spec_in_pex caused upload_spec to copy the pex file with the same source and destination. This doesn't fail with EnhancedFileSystem, but it raises an exception with fsspec filesystems
The current implementation of EnhancedFileSystem is based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825). We can entirely replace EnhancedFileSystem with fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface. Behavior change note: for put, fsspec doesn't preserve file permissions Resolves #87
jcuquemelle
requested changes
Nov 14, 2024
|
|
||
| remote_file = f"{temp_dir}/copied_script.sh" | ||
| fs.put(file, remote_file) | ||
| fs.chmod(remote_file, 0o755) |
Contributor
There was a problem hiding this comment.
I think this will probably lead to issues when users will want to retrieve a pex from hdfs and execute it, as this was handled by the preserve_acls in the EnhancedFileSystem.
Since pyarrow doesn't provide a chmod any more, I don't have a simple way to ensure that original permissions are preserved.
Another way to solve the issue would be to to the chmod locally when the pex is retrieved from hdfs by cluster_pack, but depending on the usage, this solution would be different for each use case.
for example in the spark_example.py:
#this creates an executable pex, and uploads it on hdfs (previously preserving permissions)
package_path, _ = cluster_pack.upload_env()
#this adds the hdfs path of the pex in "spark.yarn.dist.files" so it is downloaded by spark executors, and
#sets PYSPARK_PYTHON to the local pex file retrieved from hdfs in the executors.
spark_config_builder.add_packaged_environment(ssb, package_path)
It seems that if we have lost the executable permission when going from local to hdfs, it won't be executable on executor's side
jcuquemelle
approved these changes
Nov 20, 2024
npennequin
added a commit
that referenced
this pull request
Nov 26, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.
For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.
However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin
added a commit
that referenced
this pull request
Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.
For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.
However since we're using fsspec we can generalize to all its supported
file system implementations.
npennequin
added a commit
that referenced
this pull request
Nov 27, 2024
Using `fsspec.url_to_fs` resulted in using an incorrect host value
for instantiating HDFS filesystems.
For example `fsspec.url_to_fs("viewfs://root/user/someone")` would
call `fsspec.filesystem("viewfs", host="root")`, which could cause
errors. Instead in this case we need the host to be `viewfs://root`, so
we restore most of the code of that function from before #128.
However since we're using fsspec we can generalize to all its supported
file system implementations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current implementation of
EnhancedFileSystemis based on the legacy pyarrow filesystem interface that was removed in pyarrow 16.0.0 (apache/arrow#39825).We can entirely replace
EnhancedFileSystemwith fsspec. For HDFS fsspec relies on the new pyarrow filesystem interface.Resolves #87