Implement QgsArrowIterator to iterate over features as batches of ArrowArray#63749
Implement QgsArrowIterator to iterate over features as batches of ArrowArray#63749nyalldawson merged 53 commits intoqgis:masterfrom
Conversation
|
Nice work! I realise my review has a lot of comments, but they are mostly just coding style/qt quirks. Can you provide some detail on how arrow_c_stream would actually be used here? Does it need to be exposed to python, or is this purely for use by c++ code compiled against the qgis libraries? It's my understanding that a client could request specific fields (As opposed to all fields) via setSchema, is that correct? If so, we'd probably want to move the logic here "up a level" and make the class use QgsFeatureRequest instead of QgsFeatureIterator. We could then fine-tune that request based on the requested fields, and ensure that we aren't wasting effort reading in any unwanted fields from the original underlying data provider. |
🪟 Windows Qt6 buildsDownload Windows Qt6 builds of this PR for testing. 🍎 MacOS Qt6 buildsDownload MacOS Qt6 builds of this PR for testing. |
I let @paleolimbot complete, but here's a few pointers to GDAL Python bindings that expose a OGR Layer with arrow_c_stream:
|
I concur. One way to do that would be to have a
method whose base implementation would use QgsArrowStream and that the OGR provider could delegate to OGR_L_GetArrowStream() if the request is simple enough to be forwarded to OGR. and a similar method at the QgsVectorLayer level. |
|
@rouault thanks! So it looks like this class needs to be exposed to sip, and then patched into something like QgsVectorLayer via a python arrow_c_stream method? |
paleolimbot
left a comment
There was a problem hiding this comment.
Nice work! I realise my review has a lot of comments, but they are mostly just coding style/qt quirks.
Thank you! I'm definitely new here 😬
Can you provide some detail on how arrow_c_stream would actually be used here? Does it need to be exposed to python, or is this purely for use by c++ code compiled against the qgis libraries?
The motivating use case is to speed up converting QGIS layers (or input to processing tools) into GeoPandas objects maintaining the fidelity of date/time/date+time fields (I may be misrepresenting @anitagraser's original request...I haven't actually tried this). GeoPandas provides GeoDataFrame.from_arrow() which accepts any object that implements __arrow_c_stream__(requested_schema=None). I'm not sure exactly what Python object should implement that method (maybe the QgsLayer?), but the idea is that this utility takes care of the "get me Arrow stuff from QGIS stuff" part.
There are also a lot of very cool things you can do with an ArrowArrayStream, which is FFI-stable and can be sent to/from R/Python/GDAL/Rust/C++. nanoarrow can also read and write the serialized IPC format which can stream the same information into another process. It's a bit like Parquet but with no statistics (designed for streaming, not querying).
It's my understanding that a client could request specific fields (As opposed to all fields) via setSchema, is that correct? If so, we'd probably want to move the logic here "up a level" and make the class use QgsFeatureRequest instead of QgsFeatureIterator. We could then fine-tune that request based on the requested fields, and ensure that we aren't wasting effort reading in any unwanted fields from the original underlying data provider.
The schema request was intended to deal with aligning names and types more so than selecting a subset of fields. I'll take another pass at this...reordering columns is probably outside the scope of what we need to do here.
|
@paleolimbot you'll need to run the pre-commit script, which will do all the work in exposing the new class to python for you 🥳 |
Adding a couple reference links: this is defined by the Arrow PyCapsule Interface, a wrapper around the underlying Arrow C Stream interface that defines a "well-known" dunder method in Python (in this case |
Co-authored-by: Nyall Dawson <nyall.dawson@gmail.com> Co-authored-by: Even Rouault <even.rouault@spatialys.com>
0c29495 to
b36dacc
Compare
|
Not sure if this should all be exposed in Python...lots of learning going on over here 🙂 . Plenty of testing yet to do but I did get at least one schema/array through! import geopandas
from nanoarrow.c_array import allocate_c_array
import qgis
from qgis.core import QgsVectorLayer
# Create a vector layer
layer = QgsVectorLayer("tests/testdata/zonalstatistics/polys.shp", "layer_name", "ogr")
schema = qgis.core.QgsArrowIterator.inferSchema(layer)
it = qgis.core.QgsArrowIterator(layer.getFeatures())
it.setSchema(schema, 1)
c_array = allocate_c_array()
schema.exportToAddress(c_array.schema._addr())
it.nextFeatures(5, c_array._addr())
print(geopandas.GeoDataFrame.from_arrow(c_array))
#> lev3_name geometry
#> 0 poly_1 MULTIPOLYGON (((100.37934 -0.96049, 100.37934 ...
#> 1 poly_2 MULTIPOLYGON (((100.37944 -0.96044, 100.37955 ...
#> 2 poly_3 MULTIPOLYGON (((100.37938 -0.96049, 100.37949 ...
print(geopandas.read_file("tests/testdata/zonalstatistics/polys.shp"))
#> lev3_name geometry
#> 0 poly_1 POLYGON ((100.37934 -0.96049, 100.37934 -0.960...
#> 1 poly_2 POLYGON ((100.37944 -0.96044, 100.37955 -0.960...
#> 2 poly_3 POLYGON ((100.37938 -0.96049, 100.37949 -0.960... |
|
In the QGIS Server build (against Qt5?) I see: I don't see this on main so I wonder if some SIP annotation in the file I added is off? DetailsI can't seem to find why the Postgres test failed (the build report seems empty?) |
maybe try to I've restarted the PostgreSQL job (we unfortunately have random failures here and there) |
|
@paleolimbot I would assume that the .sip.in file should have been modified by the inclusion of the header and should be committed to take effect |
|
I ran so perhaps with the additional include in qgsarrowiterator.h that will take care of it? I wonder if it worked by accident before because of the order in which the sip amalgamations happened. |
hum I believe I was confused. SIP working is still dark magic to me. Let's see if the build is happier... |
|
Hmm. I'll see if there are some other missing headers when I circle back to that computer. |
|
@paleolimbot it's silly, but trying fiddling with this number: https://github.com/qgis/QGIS/blob/master/cmake/SIPMacros.cmake#L43 Try dropping it by 2 at a time till the build is happy on all platforms. 🤷♂️ |
|
@paleolimbot my final review above is very pedantic, sorry -- I've only suggested these changes because you'll still need to push some changes to repair the build, and because I think it's worth informing you of best practice for Qt/QGIS dev 👍 |
Co-authored-by: Nyall Dawson <nyall.dawson@gmail.com>
Love it!
I tried 30 -> 28 🤞 |
looks like that did it. You just need to refresh the .sip.in files from the changes done in the .h (btw, did you set-up pre-commit (cf https://docs.qgis.org/3.40/en/docs/developers_guide/git.html#procedure) ? it should automatically take care of that) |
I was speaking too fast. That fixed the ogc build, but that broke the Windows one |
|
Thanks @paleolimbot -- what a brilliant first contribution to QGIS! 🥳 |
|
Thank you all for the reviews and keeping an eye on CI! |
|
@paleolimbot this is so cool! So is this #63749 (comment) the way how you'd go about moving data to geopandas now? |
It is worth trying; however, I think we are 1-2 PRs away from the ideal situation, which is would be more like: geopandas.GeoDataFrame.from_arrow(qgis_layer_object)Or maybe: geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getFeaturesArrow())I will write up a GitHub issue with what's needed to get there 🙂 |
Description
As encouraged by @nyalldawson!
https://mastodon.social/@nyalld/115459416976982489
The motivation is to eliminate the need for per-feature iteration in Python to maintain the fidelity of types like Date/Time that have varing levels of support depending on the file formats available to GDAL. It also enables cool things like
SELECT * FROM myQgisLayerin DuckDB/SedonaDB/Python (if a QgsLayer implements__arrow_c_stream__()).Here I handle the "hard" path, which is handling an arbitrary iterator of features given an arbitrary destination schema. I wrote something similar for the ADBC Postgres driver and it's worked well there...allowing a "requested" schema can help align multiple layers and is part of the PyCapsule protocol that could be used to expose this in Python.
The (maybe) "easy" (easier?) path would be great to expose too, which would be the case where the layer exposes a method to iterate over arrow batches directly (notably: GDAL). This would be much faster because GDAL has a fast path to skip per-feature iteration for some drivers.
This uses nanoarrow ( https://github.com/apache/arrow-nanoarrow ) to build arrays. I did a vendor here but it's also available on vcpkg. (Full disclaimer: I'm an Arrow PMC and I wrote nanoarrow).
This is my first ever QGIS PR so I will likely have to work through a few things!