Skip to content

Improve Arrow interface to QgsLayers #64110

@paleolimbot

Description

@paleolimbot

Feature description

After #63749 it is possible to iterate over QgsFeatures as Arrow batches! In the interest of keeping that PR reasonably scoped, several nice-to-haves on top of the core conversion logic were deferred.

  • Implementing the Arrow PyCapsule Interface (i.e., __arrow_c_stream__(self, requested_schema=None) -> PyCapsule). This might require a tiny bit of Python C code (or a dependency) since creating Python won't let you create capsules from Python that I'm aware of.
  • Allowing layer data providers to skip iterating over QgsFeatures and provide its own ArrowArrayStream. Notably: GDAL/OGR can do this and it's much faster than per-feature iteration.

From #63749 (comment) , some places where GDAL exposes layers in this way:

From #63749 (comment) :

One way to do that would be to have a virtual bool QgsVectorDataProvider::getArrowStream( struct ArrowArrayStream* stream, const QgsFeatureRequest &request = QgsFeatureRequest() ) const method whose base implementation would use QgsArrowStream and that the OGR provider could delegate to OGR_L_GetArrowStream() if the request is simple enough to be forwarded to OGR and a similar method at the QgsVectorLayer level.

The final state after this improvement would be a compact way for Arrow Python consumers like GeoPandas to ergonomically consume a layer. Maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object)

Or maybe:

geopandas.GeoDataFrame.from_arrow(qgis_layer_object.getArrowStream())

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions