GH-43684: [Python][Dataset] Python / Cython interface to C++ arrow::dataset::Partitioning::Format#43740
Conversation
|
|
amoeba
left a comment
There was a problem hiding this comment.
I think this looks good. Returning a tuple for this makes sense to me. Can you look at my notes and also add tests?
python/pyarrow/_dataset.pyx
Outdated
There was a problem hiding this comment.
Could the remainder of this be simplified with CPartitionPathFormat.wrap? I'm not very familiar with Cython so let me know.
There was a problem hiding this comment.
Hi @amoeba , I am fairly new to arrow cython api as well, so I might be wrong. CPartitionPathFormat is just cython's name (by arrow convention) for the PartitionPathFormat C struct. The wrap and unwrap methods are methods of the _WeakRefable class, which convert cpp shared pointers to cython wrappers or vice versa. Cpp shared pointers cannot be given to python, and cython classes cannot be given to C, but both can interact in cython code, hence the wrapping. C structs of builtin types, unlike shared pointers to objects, can be automatically converted by cython into python dicts. Here I have chosen to return a tuple instead, because it feels more pythonic.
There was a problem hiding this comment.
If we aim for symmetry with the existing Parse method, we could join the strings, probably less ambiguous
|
I will add more docs and test over the weekend. |
Co-authored-by: Bryce Mecum <petridish@gmail.com>
dd18fde to
79bc16c
Compare
a712916 to
79bc16c
Compare
|
Hi @Feiyang472, this is looking good. Thanks for continuing to work on it. I left two notes and can you fix the numpydoc issue in https://github.com/apache/arrow/actions/runs/10670333956/job/29702635856?pr=43740#step:6:9090? |
Co-authored-by: Bryce Mecum <petridish@gmail.com>
|
@jorisvandenbossche and/or @pitrou: The changes here look good to me, would either of you like to review? |
|
@amoeba Hi, seems like we got no further comments, is this MR ready to be approved? |
amoeba
left a comment
There was a problem hiding this comment.
Hey @Feiyang472, sorry for dropping this.
I re-reviewed this since a second reviewer hasn't taken a look and came up with two minor changes. Once you make those and CI passes, I'll merge.
| assert result.column_names == ["a", "year", "month", "day"] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
I think we should put this in the parent test_dataset.py (pyarrow/tests/test_dataset.py) since this isn't specific to ParquetDataset. Sorry I missed this earlier.
Co-authored-by: Bryce Mecum <petridish@gmail.com>
|
Thanks so much @Feiyang472. I've merged this and slotted it in for the v18 release. |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 44fb439. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
@github-actions crossbow submit example-python-minimal-* |
|
This PR seems to have broken tests if we run with minimal pyarrow (i.e no dataset) as pytest I'll try to fix for 18.0.0 |
|
Revision: 1c8364b Submitted crossbow builds: ursacomputing/crossbow @ actions-aee53b4c69
|
|
Thanks @raulcd. |
See
#43684