Force nightly pyarrow in the upstream build#8993
Conversation
|
It seems the test is hanging after So which is a pyarrow related test, and so it might actually be identifying an issue with the latest dask / pyarrow combo. |
|
So I can reproduce this locally: The test is hanging with the above output, and I can't even interrupt it (with ctrl-c) but have to close the terminal. |
|
From some debugging locally, it seems that inspecting a parquet file ( A reproducible test for dask (using the moto server based fixtures in def test_parquet_hangs(s3, s3so):
import s3fs
dd = pytest.importorskip("dask.dataframe")
pd = pytest.importorskip("pandas")
np = pytest.importorskip("numpy")
pytest.importorskip("pyarrow")
url = "s3://%s/test.parquet" % test_bucket_name
data = pd.DataFrame({"col": np.arange(1000, dtype=np.int64)})
df = dd.from_pandas(data, chunksize=500)
df.to_parquet(url, engine="pyarrow", storage_options=s3so)
# get fsspec filesystem
from fsspec.core import get_fs_token_paths
fs, _, paths = get_fs_token_paths(url, mode="rb", storage_options=s3so)
# inspecting file with pyarrow.dataset hangs
import pyarrow.dataset as ds
format = ds.ParquetFileFormat()
from pyarrow.fs import _ensure_filesystem
filesystem = _ensure_filesystem(fs)
format.inspect(paths[0] + "/part.0.parquet", filesystem)A reproducible test for pyarrow (using the MinIO server based fixtures in @pytest.mark.parquet
@pytest.mark.s3
def test_parquet_inspect_hangs_s3(s3_server):
from pyarrow.fs import S3FileSystem, _ensure_filesystem
import pyarrow.dataset as ds
host, port, access_key, secret_key = s3_server['connection']
# create bucket + file with pyarrow
fs = S3FileSystem(
access_key=access_key,
secret_key=secret_key,
endpoint_override='{}:{}'.format(host, port),
scheme='http'
)
fs.create_dir("mybucket")
table = pa.table({'a': [1, 2, 3]})
path = "mybucket/data.parquet"
with fs.open_output_stream(path) as out:
pq.write_table(table, out)
# read using fsspec filesystem
import s3fs
fsspec_fs = s3fs.S3FileSystem(
key=access_key, secret=secret_key, client_kwargs={"endpoint_url": f"http://{host}:{port}"}
)
assert fsspec_fs.ls("mybucket") == ['mybucket/data.parquet']
# using dataset file format
format = ds.ParquetFileFormat()
filesystem = _ensure_filesystem(fsspec_fs)
schema = format.inspect(path, filesystem)
assert schema.equals(table.schema) |
|
This seems to be a bug on the pyarrow side, I opened https://issues.apache.org/jira/browse/ARROW-16413 |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @jorisvandenbossche for updating the CI environment and debugging this issue. Should we temporarily skip the hanging test and merge this PR in?
|
I have a PR open to fix this (apache/arrow#13033), so we can probably wait with merging this PR until it is fixed. But in the meantime, I did add temporary skips, so we can at least check the rest of the tests on this PR. |
|
OK, the test build is now finishing. There are still some failures, because of a deprecation warning for So that means that the pyarrow.dataset engine is still using the legacy ParquetDataset API in some place (xref #8243). cc @rjzamora |
|
Ah, it seems this is only done in a helper function defined in the tests itself: dask/dask/dataframe/io/tests/test_parquet.py Lines 1735 to 1746 in 4d6a5f0 That should be possible to rewrite to use |
Nice catch, I'll push a fix up for this.
Thanks Joris! We might have had a user run into this issue last week (never determined if it was pyarrow or fsspec's fault). Hopefully this fixed their problem too 🤞. |
Hopefully you didn't start on that yet, as I already included a commit here as well |
|
I didn't, thanks for letting me know :) |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @jorisvandenbossche!
we can probably wait with merging this PR until it is fixed
So after removing the skips and rerunning CI, say tomorrow, after there is a new nightly then this should be good to go
|
It's still picking up the nightly package of yesterday. We had a failure that caused a few packages not being uploaded, among which the linux one for py3.9. So will have to retry tomorrow. |
|
This is finally passing now! |
jrbourbeau
left a comment
There was a problem hiding this comment.
Hooray -- thanks @jorisvandenbossche!
Similar as #8281, it's still not fully clear why it is not automatically picking up the most recent version