Skip to content

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format#43740

Merged
amoeba merged 15 commits intoapache:mainfrom
Feiyang472:user/feiyang/expose_cython_api
Oct 8, 2024
Merged

GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format#43740
amoeba merged 15 commits intoapache:mainfrom
Feiyang472:user/feiyang/expose_cython_api

Conversation

@Feiyang472
Copy link
Copy Markdown
Contributor

@Feiyang472 Feiyang472 commented Aug 17, 2024

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #43684 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Returning a tuple for this makes sense to me. Can you look at my notes and also add tests?

Comment on lines 2526 to 2543
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the remainder of this be simplified with CPartitionPathFormat.wrap? I'm not very familiar with Cython so let me know.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amoeba , I am fairly new to arrow cython api as well, so I might be wrong. CPartitionPathFormat is just cython's name (by arrow convention) for the PartitionPathFormat C struct. The wrap and unwrap methods are methods of the _WeakRefable class, which convert cpp shared pointers to cython wrappers or vice versa. Cpp shared pointers cannot be given to python, and cython classes cannot be given to C, but both can interact in cython code, hence the wrapping. C structs of builtin types, unlike shared pointers to objects, can be automatically converted by cython into python dicts. Here I have chosen to return a tuple instead, because it feels more pythonic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aim for symmetry with the existing Parse method, we could join the strings, probably less ambiguous

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amoeba amoeba changed the title GH-43684: [Integration, Parquet, Python] Python / Cython interface to C++ arrow::dataset::Partitioning::Format GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format Aug 20, 2024
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 20, 2024
@Feiyang472
Copy link
Copy Markdown
Contributor Author

I will add more docs and test over the weekend.

@Feiyang472 Feiyang472 marked this pull request as ready for review August 24, 2024 16:46
@Feiyang472 Feiyang472 force-pushed the user/feiyang/expose_cython_api branch from dd18fde to 79bc16c Compare August 24, 2024 17:00
@Feiyang472 Feiyang472 requested a review from amoeba August 28, 2024 06:11
@Feiyang472 Feiyang472 force-pushed the user/feiyang/expose_cython_api branch 3 times, most recently from a712916 to 79bc16c Compare September 2, 2024 16:09
@amoeba
Copy link
Copy Markdown
Member

amoeba commented Sep 5, 2024

Hi @Feiyang472, this is looking good. Thanks for continuing to work on it. I left two notes and can you fix the numpydoc issue in https://github.com/apache/arrow/actions/runs/10670333956/job/29702635856?pr=43740#step:6:9090?

Feiyang472 and others added 3 commits September 5, 2024 20:08
@amoeba
Copy link
Copy Markdown
Member

amoeba commented Sep 6, 2024

@jorisvandenbossche and/or @pitrou: The changes here look good to me, would either of you like to review?

@Feiyang472
Copy link
Copy Markdown
Contributor Author

@amoeba Hi, seems like we got no further comments, is this MR ready to be approved?

Copy link
Copy Markdown
Member

@amoeba amoeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Feiyang472, sorry for dropping this.

I re-reviewed this since a second reviewer hasn't taken a look and came up with two minor changes. Once you make those and CI passes, I'll merge.

assert result.column_names == ["a", "year", "month", "day"]


@pytest.mark.parametrize(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put this in the parent test_dataset.py (pyarrow/tests/test_dataset.py) since this isn't specific to ParquetDataset. Sorry I missed this earlier.

@Feiyang472 Feiyang472 requested a review from amoeba October 7, 2024 18:02
@amoeba amoeba merged commit 44fb439 into apache:main Oct 8, 2024
@amoeba amoeba removed the awaiting committer review Awaiting committer review label Oct 8, 2024
@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Oct 8, 2024
@amoeba
Copy link
Copy Markdown
Member

amoeba commented Oct 8, 2024

Thanks so much @Feiyang472. I've merged this and slotted it in for the v18 release.

@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44fb439.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

@raulcd
Copy link
Copy Markdown
Member

raulcd commented Oct 10, 2024

@github-actions crossbow submit example-python-minimal-*

@raulcd
Copy link
Copy Markdown
Member

raulcd commented Oct 10, 2024

This PR seems to have broken tests if we run with minimal pyarrow (i.e no dataset) as pytest @pytest.mark.parametrize fails if dataset is not enabled:

 _ ERROR collecting miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py _
miniconda-for-arrow/envs/pyarrow-3.10/lib/python3.10/site-packages/pyarrow/tests/test_dataset.py:740: in <module>
    (ds.HivePartitioning, (r"foo=A/bar=ant%20bee", ""), ("", "")),
E   AttributeError: 'NoneType' object has no attribute 'HivePartitioning'

I'll try to fix for 18.0.0

@github-actions
Copy link
Copy Markdown

Revision: 1c8364b

Submitted crossbow builds: ursacomputing/crossbow @ actions-aee53b4c69

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions

@amoeba
Copy link
Copy Markdown
Member

amoeba commented Oct 10, 2024

Thanks @raulcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants