ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset by jorisvandenbossche · Pull Request #14052 · apache/arrow

jorisvandenbossche · 2022-09-06T12:41:32Z

No description provided.

…=True in ParquetDataset

github-actions · 2022-09-06T13:30:13Z

https://issues.apache.org/jira/browse/ARROW-16728

github-actions · 2022-09-06T13:30:15Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

…etDataset-switch-default

jorisvandenbossche · 2022-10-14T16:31:14Z

@github-actions crossbow submit -g python

github-actions · 2022-10-14T19:45:32Z

Revision: 045f680

Submitted crossbow builds: ursacomputing/crossbow @ actions-fbe296648d

Task	Status
test-conda-python-3.10
test-conda-python-3.7
test-conda-python-3.7-hdfs-2.9.2
test-conda-python-3.7-hdfs-3.2.1
test-conda-python-3.7-pandas-0.24
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-spark-v3.1.2
test-conda-python-3.8
test-conda-python-3.8-hypothesis
test-conda-python-3.8-pandas-latest
test-conda-python-3.8-pandas-nightly
test-conda-python-3.8-spark-v3.2.0
test-conda-python-3.9
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-master
test-conda-python-3.9-pandas-master
test-conda-python-3.9-spark-master
test-debian-11-python-3
test-fedora-35-python-3
test-ubuntu-20.04-python-3

…etDataset-switch-default

jorisvandenbossche · 2022-12-23T09:47:15Z

@github-actions crossbow submit -g python

github-actions · 2022-12-23T09:51:17Z

Revision: a8989b4

Submitted crossbow builds: ursacomputing/crossbow @ actions-5e75775ce9

Task	Status
test-conda-python-3.10
test-conda-python-3.11
test-conda-python-3.7
test-conda-python-3.7-hdfs-2.9.2
test-conda-python-3.7-hdfs-3.2.1
test-conda-python-3.7-pandas-1.0
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-spark-v3.1.2
test-conda-python-3.8
test-conda-python-3.8-hypothesis
test-conda-python-3.8-pandas-latest
test-conda-python-3.8-pandas-nightly
test-conda-python-3.8-spark-v3.2.0
test-conda-python-3.9
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-upstream_devel
test-conda-python-3.9-pandas-upstream_devel
test-conda-python-3.9-spark-master
test-cuda-python
test-debian-11-python-3
test-fedora-35-python-3
test-ubuntu-20.04-python-3

jorisvandenbossche · 2022-12-23T12:24:29Z

@github-actions crossbow submit -g python

github-actions · 2022-12-23T12:28:32Z

Revision: 16e33d3

Submitted crossbow builds: ursacomputing/crossbow @ actions-6924cff3d2

Task	Status
test-conda-python-3.10
test-conda-python-3.11
test-conda-python-3.7
test-conda-python-3.7-hdfs-2.9.2
test-conda-python-3.7-hdfs-3.2.1
test-conda-python-3.7-pandas-1.0
test-conda-python-3.7-pandas-latest
test-conda-python-3.7-spark-v3.1.2
test-conda-python-3.8
test-conda-python-3.8-hypothesis
test-conda-python-3.8-pandas-latest
test-conda-python-3.8-pandas-nightly
test-conda-python-3.8-spark-v3.2.0
test-conda-python-3.9
test-conda-python-3.9-dask-latest
test-conda-python-3.9-dask-upstream_devel
test-conda-python-3.9-pandas-upstream_devel
test-conda-python-3.9-spark-master
test-cuda-python
test-debian-11-python-3
test-fedora-35-python-3
test-ubuntu-20.04-python-3

ursabot · 2022-12-24T00:12:01Z

Benchmark runs are scheduled for baseline = 305026f and contender = 5a98058. 5a98058 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 5a980580 ec2-t3-xlarge-us-east-2
[Failed] 5a980580 test-mac-arm
[Finished] 5a980580 ursa-i9-9960x
[Finished] 5a980580 ursa-thinkcentre-m75q
[Finished] 305026f6 ec2-t3-xlarge-us-east-2
[Failed] 305026f6 test-mac-arm
[Finished] 305026f6 ursa-i9-9960x
[Finished] 305026f6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kou · 2022-12-24T20:13:37Z

@jorisvandenbossche test-conda-python-3.7-hdfs-* failures seem to related:

https://github.com/ursacomputing/crossbow/actions/runs/3765628706/jobs/6401304143#step:5:9344

=================================== FAILURES ===================================
_________________ TestLibHdfs.test_read_multiple_parquet_files _________________

self = <pyarrow.tests.test_hdfs.TestLibHdfs testMethod=test_read_multiple_parquet_files>

    @pytest.mark.pandas
    @pytest.mark.parquet
    def test_read_multiple_parquet_files(self):
    
        tmpdir = pjoin(self.tmp_path, 'multi-parquet-' + guid())
    
        self.hdfs.mkdir(tmpdir)
    
        expected = self._write_multiple_hdfs_pq_files(tmpdir)
>       result = self.hdfs.read_parquet(tmpdir)

opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py:318: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/filesystem.py:227: in read_parquet
    filesystem=self)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet/core.py:1759: in __new__
    thrift_container_size_limit=thrift_container_size_limit,
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet/core.py:2402: in __init__
    filesystem, use_mmap=memory_map)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filesystem = <pyarrow.hdfs.HadoopFileSystem object at 0x7ff3f8702d40>
use_mmap = False, allow_legacy_filesystem = False

    def _ensure_filesystem(
        filesystem, use_mmap=False, allow_legacy_filesystem=False
    ):
        if isinstance(filesystem, FileSystem):
            return filesystem
        elif isinstance(filesystem, str):
            if use_mmap:
                raise ValueError(
                    "Specifying to use memory mapping not supported for "
                    "filesystem specified as an URI string"
                )
            return _filesystem_from_str(filesystem)
    
        # handle fsspec-compatible filesystems
        try:
            import fsspec
        except ImportError:
            pass
        else:
            if isinstance(filesystem, fsspec.AbstractFileSystem):
                if type(filesystem).__name__ == 'LocalFileSystem':
                    # In case its a simple LocalFileSystem, use native arrow one
                    return LocalFileSystem(use_mmap=use_mmap)
                return PyFileSystem(FSSpecHandler(filesystem))
    
        # map old filesystems to new ones
        import pyarrow.filesystem as legacyfs
    
        if isinstance(filesystem, legacyfs.LocalFileSystem):
            return LocalFileSystem(use_mmap=use_mmap)
        # TODO handle HDFS?
        if allow_legacy_filesystem and isinstance(filesystem, legacyfs.FileSystem):
            return filesystem
    
        raise TypeError(
            "Unrecognized filesystem: {}. `filesystem` argument must be a "
            "FileSystem instance or a valid file system URI'".format(
>               type(filesystem))
        )
E       TypeError: Unrecognized filesystem: <class 'pyarrow.hdfs.HadoopFileSystem'>. `filesystem` argument must be a FileSystem instance or a valid file system URI'

…=True in ParquetDataset (apache#14052) Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2023-01-09T11:47:34Z

@kou sorry, not sure how I missed that failure. Somehow I was convinced that I already inspected the HDFS builds and concluded it was an unrelated failure .. Fixing this in #15269

ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset…

97f9bd2

…=True in ParquetDataset

github-actions bot added the Component: Python label Sep 6, 2022

jorisvandenbossche added 3 commits September 8, 2022 09:12

Use legacy dataset in test using _metadata

5aadb41

Merge remote-tracking branch 'upstream/master' into ARROW-16728-Parqu…

045f680

…etDataset-switch-default

Merge remote-tracking branch 'upstream/master' into ARROW-16728-Parqu…

074d26c

…etDataset-switch-default

Merge remote-tracking branch 'upstream/master' into ARROW-16728-Parqu…

a8989b4

…etDataset-switch-default

fix test with pandas metadata from common metadata

16e33d3

jorisvandenbossche merged commit 5a98058 into apache:master Dec 23, 2022

jorisvandenbossche deleted the ARROW-16728-ParquetDataset-switch-default branch December 24, 2022 08:32

jorisvandenbossche mentioned this pull request Jan 9, 2023

ARROW-16728: [Python] ParquetDataset to still take legacy code path when old filesystem is passed #15269

Merged

asfimport mentioned this pull request Jan 10, 2023

[Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset #32067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset#14052

ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset#14052
jorisvandenbossche merged 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-16728-ParquetDataset-switch-default

jorisvandenbossche commented Sep 6, 2022

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

jorisvandenbossche commented Oct 14, 2022

Uh oh!

github-actions bot commented Oct 14, 2022

Uh oh!

jorisvandenbossche commented Dec 23, 2022

Uh oh!

github-actions bot commented Dec 23, 2022

Uh oh!

jorisvandenbossche commented Dec 23, 2022

Uh oh!

github-actions bot commented Dec 23, 2022

Uh oh!

ursabot commented Dec 24, 2022

Uh oh!

kou commented Dec 24, 2022

Uh oh!

jorisvandenbossche commented Jan 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jorisvandenbossche commented Sep 6, 2022

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

github-actions bot commented Sep 6, 2022

Uh oh!

jorisvandenbossche commented Oct 14, 2022

Uh oh!

github-actions bot commented Oct 14, 2022

Uh oh!

jorisvandenbossche commented Dec 23, 2022

Uh oh!

github-actions bot commented Dec 23, 2022

Uh oh!

jorisvandenbossche commented Dec 23, 2022

Uh oh!

github-actions bot commented Dec 23, 2022

Uh oh!

ursabot commented Dec 24, 2022

Uh oh!

kou commented Dec 24, 2022

Uh oh!

jorisvandenbossche commented Jan 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants