Skip to content

Implement QgsArrowIterator to iterate over features as batches of ArrowArray#63749

Merged
nyalldawson merged 53 commits intoqgis:masterfrom
paleolimbot:layer-arrow
Nov 27, 2025
Merged

Implement QgsArrowIterator to iterate over features as batches of ArrowArray#63749
nyalldawson merged 53 commits intoqgis:masterfrom
paleolimbot:layer-arrow

Conversation

@paleolimbot
Copy link
Copy Markdown
Contributor

Description

As encouraged by @nyalldawson!

https://mastodon.social/@nyalld/115459416976982489

The motivation is to eliminate the need for per-feature iteration in Python to maintain the fidelity of types like Date/Time that have varing levels of support depending on the file formats available to GDAL. It also enables cool things like SELECT * FROM myQgisLayer in DuckDB/SedonaDB/Python (if a QgsLayer implements __arrow_c_stream__()).

Here I handle the "hard" path, which is handling an arbitrary iterator of features given an arbitrary destination schema. I wrote something similar for the ADBC Postgres driver and it's worked well there...allowing a "requested" schema can help align multiple layers and is part of the PyCapsule protocol that could be used to expose this in Python.

The (maybe) "easy" (easier?) path would be great to expose too, which would be the case where the layer exposes a method to iterate over arrow batches directly (notably: GDAL). This would be much faster because GDAL has a fast path to skip per-feature iteration for some drivers.

This uses nanoarrow ( https://github.com/apache/arrow-nanoarrow ) to build arrays. I did a vendor here but it's also available on vcpkg. (Full disclaimer: I'm an Arrow PMC and I wrote nanoarrow).

This is my first ever QGIS PR so I will likely have to work through a few things!

@github-actions github-actions bot added this to the 4.0.0 milestone Oct 30, 2025
@nyalldawson
Copy link
Copy Markdown
Collaborator

@paleolimbot

Nice work! I realise my review has a lot of comments, but they are mostly just coding style/qt quirks.

Can you provide some detail on how arrow_c_stream would actually be used here? Does it need to be exposed to python, or is this purely for use by c++ code compiled against the qgis libraries?

It's my understanding that a client could request specific fields (As opposed to all fields) via setSchema, is that correct? If so, we'd probably want to move the logic here "up a level" and make the class use QgsFeatureRequest instead of QgsFeatureIterator. We could then fine-tune that request based on the requested fields, and ensure that we aren't wasting effort reading in any unwanted fields from the original underlying data provider.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 31, 2025

🪟 Windows Qt6 builds

Download Windows Qt6 builds of this PR for testing.
(Built from commit 0c17949)

🍎 MacOS Qt6 builds

Download MacOS Qt6 builds of this PR for testing.
This installer is not signed, control+click > open the app to avoid the warning
(Built from commit 0c17949)

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Oct 31, 2025

Can you provide some detail on how arrow_c_stream would actually be used here? Does it need to be exposed to python, or is this purely for use by c++ code compiled against the qgis libraries?

I let @paleolimbot complete, but here's a few pointers to GDAL Python bindings that expose a OGR Layer with arrow_c_stream:

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Oct 31, 2025

The (maybe) "easy" (easier?) path would be great to expose too, which would be the case where the layer exposes a method to iterate over arrow batches directly (notably: GDAL). This would be much faster because GDAL has a fast path to skip per-feature iteration for some drivers.

I concur. One way to do that would be to have a

virtual bool QgsVectorDataProvider::getArrowStream( struct ArrowArrayStream* stream, const QgsFeatureRequest &request = QgsFeatureRequest() ) const

method whose base implementation would use QgsArrowStream and that the OGR provider could delegate to OGR_L_GetArrowStream() if the request is simple enough to be forwarded to OGR.

and a similar method at the QgsVectorLayer level.

@nyalldawson
Copy link
Copy Markdown
Collaborator

@rouault thanks! So it looks like this class needs to be exposed to sip, and then patched into something like QgsVectorLayer via a python arrow_c_stream method?

Copy link
Copy Markdown
Contributor Author

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I realise my review has a lot of comments, but they are mostly just coding style/qt quirks.

Thank you! I'm definitely new here 😬

Can you provide some detail on how arrow_c_stream would actually be used here? Does it need to be exposed to python, or is this purely for use by c++ code compiled against the qgis libraries?

The motivating use case is to speed up converting QGIS layers (or input to processing tools) into GeoPandas objects maintaining the fidelity of date/time/date+time fields (I may be misrepresenting @anitagraser's original request...I haven't actually tried this). GeoPandas provides GeoDataFrame.from_arrow() which accepts any object that implements __arrow_c_stream__(requested_schema=None). I'm not sure exactly what Python object should implement that method (maybe the QgsLayer?), but the idea is that this utility takes care of the "get me Arrow stuff from QGIS stuff" part.

There are also a lot of very cool things you can do with an ArrowArrayStream, which is FFI-stable and can be sent to/from R/Python/GDAL/Rust/C++. nanoarrow can also read and write the serialized IPC format which can stream the same information into another process. It's a bit like Parquet but with no statistics (designed for streaming, not querying).

It's my understanding that a client could request specific fields (As opposed to all fields) via setSchema, is that correct? If so, we'd probably want to move the logic here "up a level" and make the class use QgsFeatureRequest instead of QgsFeatureIterator. We could then fine-tune that request based on the requested fields, and ensure that we aren't wasting effort reading in any unwanted fields from the original underlying data provider.

The schema request was intended to deal with aligning names and types more so than selecting a subset of fields. I'll take another pass at this...reordering columns is probably outside the scope of what we need to do here.

@nyalldawson
Copy link
Copy Markdown
Collaborator

@paleolimbot you'll need to run the pre-commit script, which will do all the work in exposing the new class to python for you 🥳

@kylebarron
Copy link
Copy Markdown

GeoPandas provides GeoDataFrame.from_arrow() which accepts any object that implements __arrow_c_stream__(requested_schema=None).

Adding a couple reference links: this is defined by the Arrow PyCapsule Interface, a wrapper around the underlying Arrow C Stream interface that defines a "well-known" dunder method in Python (in this case __arrow_c_stream__) that other Arrow-aware Python packages can check for.

@paleolimbot
Copy link
Copy Markdown
Contributor Author

Not sure if this should all be exposed in Python...lots of learning going on over here 🙂 . Plenty of testing yet to do but I did get at least one schema/array through!

import geopandas
from nanoarrow.c_array import allocate_c_array
import qgis
from qgis.core import QgsVectorLayer

# Create a vector layer
layer = QgsVectorLayer("tests/testdata/zonalstatistics/polys.shp", "layer_name", "ogr")
schema = qgis.core.QgsArrowIterator.inferSchema(layer)

it = qgis.core.QgsArrowIterator(layer.getFeatures())
it.setSchema(schema, 1)

c_array = allocate_c_array()
schema.exportToAddress(c_array.schema._addr())
it.nextFeatures(5, c_array._addr())

print(geopandas.GeoDataFrame.from_arrow(c_array))
#> lev3_name                                           geometry
#> 0    poly_1  MULTIPOLYGON (((100.37934 -0.96049, 100.37934 ...
#> 1    poly_2  MULTIPOLYGON (((100.37944 -0.96044, 100.37955 ...
#> 2    poly_3  MULTIPOLYGON (((100.37938 -0.96049, 100.37949 ...

print(geopandas.read_file("tests/testdata/zonalstatistics/polys.shp"))
#> lev3_name                                           geometry
#> 0    poly_1  POLYGON ((100.37934 -0.96049, 100.37934 -0.960...
#> 1    poly_2  POLYGON ((100.37944 -0.96044, 100.37955 -0.960...
#> 2    poly_3  POLYGON ((100.37938 -0.96049, 100.37949 -0.960...

@paleolimbot
Copy link
Copy Markdown
Contributor Author

In the QGIS Server build (against Qt5?) I see:

 error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?

I don't see this on main so I wonder if some SIP annotation in the file I added is off?

Details
usr/src/qgis/build/python/core/sip_corepart6.cpp:111:19: error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  111 |         return  ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase::nextFeature(a0);
      |                 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:156:19: error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  156 |         return  ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase::isValid();
      |                 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:186:19: error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  186 |         return  ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase::nextFeatureFilterExpression(a0);
      |                 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:201:19: error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  201 |         return  ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase::nextFeatureFilterFids(a0);
      |                 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:216:19: error: no member named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  216 |         return  ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase::prepareSimplification(a0);
      |                 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                   sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:792:8: error: no type named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  792 |      ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase *sipCpp = reinterpret_cast< ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase *>(sipCppV);
      |      ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |        sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
/usr/src/qgis/build/python/core/sip_corepart6.cpp:792:106: error: no type named 'QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' in the global namespace; did you mean 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase'?
  792 |      ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase *sipCpp = reinterpret_cast< ::QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase *>(sipCppV);
      |                                                                                                        ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                                                          sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase
/usr/src/qgis/build/python/core/sip_corepart6.cpp:42:7: note: 'sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase' declared here
   42 | class sipQgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase : public QgsAbstractFeatureIteratorFromSource<QgsVectorLayerFeatureSource>
      |       ^
7 errors generated.

I can't seem to find why the Postgres test failed (the build report seems empty?)

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Nov 26, 2025

QgsAbstractFeatureIteratorFromSourceQgsVectorLayerFeatureSourceBase

maybe try to #include "qgsvectorlayer.h in "qgsarrowiterator.h". It is currently missing, so strictly speaking the .h isn't "standalone", which might confuse SIP

I've restarted the PostgreSQL job (we unfortunately have random failures here and there)

@jef-n jef-n added the Squash! Remember to squash this PR, instead of merging or rebasing label Nov 26, 2025
@rouault
Copy link
Copy Markdown
Contributor

rouault commented Nov 26, 2025

@paleolimbot I would assume that the .sip.in file should have been modified by the inclusion of the header and should be committed to take effect

@paleolimbot
Copy link
Copy Markdown
Contributor Author

I ran ./scripts/sipify_all.sh and nothing happened. I see

%TypeHeaderCode
#include "qgsarrowiterator.h"

so perhaps with the additional include in qgsarrowiterator.h that will take care of it? I wonder if it worked by accident before because of the order in which the sip amalgamations happened.

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Nov 26, 2025

and nothing happened

hum I believe I was confused. SIP working is still dark magic to me. Let's see if the build is happier...

@paleolimbot
Copy link
Copy Markdown
Contributor Author

Hmm. I'll see if there are some other missing headers when I circle back to that computer.

@nyalldawson
Copy link
Copy Markdown
Collaborator

@paleolimbot it's silly, but trying fiddling with this number: https://github.com/qgis/QGIS/blob/master/cmake/SIPMacros.cmake#L43

Try dropping it by 2 at a time till the build is happy on all platforms. 🤷‍♂️

@nyalldawson
Copy link
Copy Markdown
Collaborator

nyalldawson commented Nov 27, 2025

@paleolimbot my final review above is very pedantic, sorry -- I've only suggested these changes because you'll still need to push some changes to repair the build, and because I think it's worth informing you of best practice for Qt/QGIS dev 👍

@paleolimbot
Copy link
Copy Markdown
Contributor Author

my final review above is very pedantic, sorry

Love it!

Try dropping it by 2 at a time till the build is happy on all platforms.

I tried 30 -> 28 🤞

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Nov 27, 2025

I tried 30 -> 28 🤞

looks like that did it. You just need to refresh the .sip.in files from the changes done in the .h (btw, did you set-up pre-commit (cf https://docs.qgis.org/3.40/en/docs/developers_guide/git.html#procedure) ? it should automatically take care of that)

@rouault
Copy link
Copy Markdown
Contributor

rouault commented Nov 27, 2025

looks like that did it.

I was speaking too fast. That fixed the ogc build, but that broke the Windows one

@nyalldawson nyalldawson merged commit ec6fe9b into qgis:master Nov 27, 2025
33 checks passed
@nyalldawson
Copy link
Copy Markdown
Collaborator

Thanks @paleolimbot -- what a brilliant first contribution to QGIS! 🥳

@paleolimbot
Copy link
Copy Markdown
Contributor Author

Thank you all for the reviews and keeping an eye on CI!

@paleolimbot paleolimbot deleted the layer-arrow branch November 28, 2025 02:18
@martinfleis
Copy link
Copy Markdown

@paleolimbot this is so cool! So is this #63749 (comment) the way how you'd go about moving data to geopandas now?

@paleolimbot
Copy link
Copy Markdown
Contributor Author

So is this this the way how you'd go about moving data to geopandas now?

It is worth trying; however, I think we are 1-2 PRs away from the ideal situation, which is would be more like:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object)

Or maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getFeaturesArrow())

I will write up a GitHub issue with what's needed to get there 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Squash! Remember to squash this PR, instead of merging or rebasing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants