Skip to content

ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset#14052

Merged
jorisvandenbossche merged 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-16728-ParquetDataset-switch-default
Dec 23, 2022
Merged

ARROW-16728: [Python] Switch default and deprecate use_legacy_dataset=True in ParquetDataset#14052
jorisvandenbossche merged 6 commits intoapache:masterfrom
jorisvandenbossche:ARROW-16728-ParquetDataset-switch-default

Conversation

@jorisvandenbossche
Copy link
Copy Markdown
Member

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 6, 2022

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 6, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

Revision: 045f680

Submitted crossbow builds: ursacomputing/crossbow @ actions-fbe296648d

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-hdfs-2.9.2 Github Actions
test-conda-python-3.7-hdfs-3.2.1 Github Actions
test-conda-python-3.7-pandas-0.24 Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-spark-v3.1.2 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-v3.2.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-dask-latest Github Actions
test-conda-python-3.9-dask-master Github Actions
test-conda-python-3.9-pandas-master Github Actions
test-conda-python-3.9-spark-master Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

Revision: a8989b4

Submitted crossbow builds: ursacomputing/crossbow @ actions-5e75775ce9

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-hdfs-2.9.2 Github Actions
test-conda-python-3.7-hdfs-3.2.1 Github Actions
test-conda-python-3.7-pandas-1.0 Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-spark-v3.1.2 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-v3.2.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-dask-latest Github Actions
test-conda-python-3.9-dask-upstream_devel Github Actions
test-conda-python-3.9-pandas-upstream_devel Github Actions
test-conda-python-3.9-spark-master Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure

@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@github-actions crossbow submit -g python

@github-actions
Copy link
Copy Markdown

Revision: 16e33d3

Submitted crossbow builds: ursacomputing/crossbow @ actions-6924cff3d2

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.7 Github Actions
test-conda-python-3.7-hdfs-2.9.2 Github Actions
test-conda-python-3.7-hdfs-3.2.1 Github Actions
test-conda-python-3.7-pandas-1.0 Github Actions
test-conda-python-3.7-pandas-latest Github Actions
test-conda-python-3.7-spark-v3.1.2 Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-hypothesis Github Actions
test-conda-python-3.8-pandas-latest Github Actions
test-conda-python-3.8-pandas-nightly Github Actions
test-conda-python-3.8-spark-v3.2.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-dask-latest Github Actions
test-conda-python-3.9-dask-upstream_devel Github Actions
test-conda-python-3.9-pandas-upstream_devel Github Actions
test-conda-python-3.9-spark-master Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure

@jorisvandenbossche jorisvandenbossche merged commit 5a98058 into apache:master Dec 23, 2022
@ursabot
Copy link
Copy Markdown

ursabot commented Dec 24, 2022

Benchmark runs are scheduled for baseline = 305026f and contender = 5a98058. 5a98058 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.17% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 5a980580 ec2-t3-xlarge-us-east-2
[Failed] 5a980580 test-mac-arm
[Finished] 5a980580 ursa-i9-9960x
[Finished] 5a980580 ursa-thinkcentre-m75q
[Finished] 305026f6 ec2-t3-xlarge-us-east-2
[Failed] 305026f6 test-mac-arm
[Finished] 305026f6 ursa-i9-9960x
[Finished] 305026f6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@jorisvandenbossche jorisvandenbossche deleted the ARROW-16728-ParquetDataset-switch-default branch December 24, 2022 08:32
@kou
Copy link
Copy Markdown
Member

kou commented Dec 24, 2022

@jorisvandenbossche test-conda-python-3.7-hdfs-* failures seem to related:

https://github.com/ursacomputing/crossbow/actions/runs/3765628706/jobs/6401304143#step:5:9344

=================================== FAILURES ===================================
_________________ TestLibHdfs.test_read_multiple_parquet_files _________________

self = <pyarrow.tests.test_hdfs.TestLibHdfs testMethod=test_read_multiple_parquet_files>

    @pytest.mark.pandas
    @pytest.mark.parquet
    def test_read_multiple_parquet_files(self):
    
        tmpdir = pjoin(self.tmp_path, 'multi-parquet-' + guid())
    
        self.hdfs.mkdir(tmpdir)
    
        expected = self._write_multiple_hdfs_pq_files(tmpdir)
>       result = self.hdfs.read_parquet(tmpdir)

opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_hdfs.py:318: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/filesystem.py:227: in read_parquet
    filesystem=self)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet/core.py:1759: in __new__
    thrift_container_size_limit=thrift_container_size_limit,
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/parquet/core.py:2402: in __init__
    filesystem, use_mmap=memory_map)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

filesystem = <pyarrow.hdfs.HadoopFileSystem object at 0x7ff3f8702d40>
use_mmap = False, allow_legacy_filesystem = False

    def _ensure_filesystem(
        filesystem, use_mmap=False, allow_legacy_filesystem=False
    ):
        if isinstance(filesystem, FileSystem):
            return filesystem
        elif isinstance(filesystem, str):
            if use_mmap:
                raise ValueError(
                    "Specifying to use memory mapping not supported for "
                    "filesystem specified as an URI string"
                )
            return _filesystem_from_str(filesystem)
    
        # handle fsspec-compatible filesystems
        try:
            import fsspec
        except ImportError:
            pass
        else:
            if isinstance(filesystem, fsspec.AbstractFileSystem):
                if type(filesystem).__name__ == 'LocalFileSystem':
                    # In case its a simple LocalFileSystem, use native arrow one
                    return LocalFileSystem(use_mmap=use_mmap)
                return PyFileSystem(FSSpecHandler(filesystem))
    
        # map old filesystems to new ones
        import pyarrow.filesystem as legacyfs
    
        if isinstance(filesystem, legacyfs.LocalFileSystem):
            return LocalFileSystem(use_mmap=use_mmap)
        # TODO handle HDFS?
        if allow_legacy_filesystem and isinstance(filesystem, legacyfs.FileSystem):
            return filesystem
    
        raise TypeError(
            "Unrecognized filesystem: {}. `filesystem` argument must be a "
            "FileSystem instance or a valid file system URI'".format(
>               type(filesystem))
        )
E       TypeError: Unrecognized filesystem: <class 'pyarrow.hdfs.HadoopFileSystem'>. `filesystem` argument must be a FileSystem instance or a valid file system URI'

EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…=True in ParquetDataset (apache#14052)

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Copy Markdown
Member Author

@kou sorry, not sure how I missed that failure. Somehow I was convinced that I already inspected the HDFS builds and concluded it was an unrelated failure .. Fixing this in #15269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants