You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GH-40142: [Python] Allow FileInfo instances to be passed to dataset init (#40143)
### Rationale for this change
Closes#40142
I'm developing a new dask integration with pyarrow parquet reader (see dask/dask-expr#882) and want to rely on the pyarrow Filesystem more.
Right now, we are performing a list operation ourselves to get all touched files and I would like to pass the retrieved `FileInfo` objects directly to the dataset constructor. This API is already exposed in C++ and this PR is adding the necessary python bindings.
The benefit of this is that there is API is that it cuts the need to perform additional HEAD requests to a remote storage.
This came up in #38389 (comment) and there's been related work already with #37857
### What changes are included in this PR?
Python bindings for the `DatasetFactory` constructor that accepts a list/vector of `FileInfo` objects.
### Are these changes tested?
~I slightly modified the minio test setup such that the prometheus endpoint is exposed. This can be used to assert that there hasn't been any HEAD requests.~ I ended up removing this again since parsing the response is a bit brittle.
### Are there any user-facing changes?
* Closes: #40142
Lead-authored-by: fjetter <fjetter@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
0 commit comments