GH-46374: [Python][Docs] Adds type checks for source in read_table#46330
GH-46374: [Python][Docs] Adds type checks for source in read_table#46330Mottl wants to merge 6 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
raulcd
left a comment
There was a problem hiding this comment.
This is tricky because if the dataset module is enabled, yes we can use a list of strings but if we've built pyarrow without dataset a list of strings won't work because instead of using ParquetDataset->source we will be using ParquetFile->source.
I don't think we want to mislead users thinking a list of strings can be used when it might not be the case. cc @AlenkaF
|
@raulcd I suppose we want to document this conditional behavior? |
That makes sense, then something in the lines: |
|
I agree that the conditional behavior is worth documenting.
This is great. I would just suggest changing "can also be used" to "can also be passed" to align more closely with the existing phrasing. One more thing to note: this will require a test that would use the dataset mark. Because of that, the PR no longer fits into the MINOR category and will need a corresponding issue to track the change. |
|
Agreed with the proposed text (as amended by Alenka). For the dataset marked test - might as well be a separate PR? |
|
Can we just |
|
I don't think we want to go in that direction—concatenation would introduce complexity that's already handled in the Dataset C++ layer (schema mismatches is a good example). Instead, we could consider adding a warning and suggest the use of the dataset module when passing files as a list (or a dict) here: arrow/python/pyarrow/parquet/core.py Line 1828 in a2e6136 |
|
Okay, close this PR as soon as you finish with the discussion. Thanks all! |
|
Have a look please |
Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
|
|
1 similar comment
|
|
| source = filesystem.open_input_file(path) | ||
| # TODO test that source is not a directory or a list | ||
| if not ( | ||
| (isinstance(source, str) and not os.path.isdir(source)) |
There was a problem hiding this comment.
What if source is a folder in an S3 bucket? os.path.isdir
will return True for existing folder. Will it check through the s3 fs? That would add a network call. Can we just check if the string is a valid folder path and not if it also exists?
Rationale for this change
pyarrow.parquet.read_tableactually supportssourceparameter as a list of stringsAre these changes tested?
No, an issue to test those has been opened as this code path is currently not being tested. See:
Are there any user-facing changes?
no, only docstring