[Data] Raise future warning if invalid Parquet extensions#50092
[Data] Raise future warning if invalid Parquet extensions#50092bveeramani merged 23 commits intomasterfrom
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
|
@bveeramani this didn't seem to work for me? |
| "parquet.snappy", | ||
| "snappy.parquet", | ||
| # Gzip compression | ||
| "parquet.gz", | ||
| # Brotili compression | ||
| "parquet.br", | ||
| # Lz4 compression | ||
| "parquet.lz4", | ||
| # Zstd compression | ||
| "parquet.zst", |
There was a problem hiding this comment.
Can you help me understand where these are coming from? It should be .snappy.parquet for ex, not the other way around
There was a problem hiding this comment.
These are the canonical file extensions for the compression formats that PyArrow supports.
I agree that Misread your comment. I've seen bothparquet.snappy is more common, but I've also seen snappy.parquet, so I included it.
How should I change this list?
@richardliaw how are your warnings configured? Do you have Ray Data emits the warning when I test it an interactive session and with the unit test:
|
|
Interesting, well I guess in theory the code looks right. I don't have warnings configured, so not sure why it's not showing up. |
|
tests failing |
Investigating 👀 |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…t#50092) People often have non-Parquet files in their datasets (e.g., `_SUCCESS` or stale files). However, the default for `file_extensions` is `None`, so `read_parquet` tries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like `["parquet"]`. This PR adds a warning for that change. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Why are these changes needed?
People often have non-Parquet files in their datasets (e.g.,
_SUCCESSor stale files). However, the default forfile_extensionsisNone, soread_parquettries reading the non-Parquet files. To avoid this issue, we'll change the default file extensions to something like["parquet"]. This PR adds a warning for that change.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.