Skip to content

Using STRtree in multithreading causes segfaults #361

@jorisvandenbossche

Description

@jorisvandenbossche

While implementing and testing a spatial join in dask-geopandas (geopandas/dask-geopandas#54), I ran into a dying kernel (segfault) once in a while. And I was now able to boil it down into a small script that for me segfaults in at least 1 out of 2 runs.

I further reduced that script to not use dask but a plain ThreadPoolExecutor, which still reproduces the issue, see below (probably I could even further boil it down to just use pygeos.STRtree instead of calling geopandas.sjoin).

Running the script gives (circa 1 out of 2 runs):

$ python test_parallel_sjoin.py 
double free or corruption (out)
Aborted (core dumped)

Running it inside gdb gives the following backtrace:

Details
(gdb) run
Starting program: /home/joris/miniconda3/envs/geo-dev/bin/python test_parallel_sjoin.py
...
double free or corruption (out)

Thread 9 "python" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f75e9ffb700 (LWP 1371190)]
0x00007f761a32618b in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f761a32618b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f761a305859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f761a3703ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f761a37847c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007f761a37a120 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007f7603cdc78b in void std::vector<geos::index::strtree::AbstractNode*, std::allocator<geos::index::strtree::AbstractNode*> >::_M_realloc_insert<geos::index::strtree::AbstractNode* const&>(__gnu_cxx::__normal_iterator<geos::index::strtree::AbstractNode**, std::vector<geos::index::strtree::AbstractNode*, std::allocator<geos::index::strtree::AbstractNode*> > >, geos::index::strtree::AbstractNode* const&) ()
   from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#6  0x00007f7603cdddd1 in geos::index::strtree::STRtree::createNode(int) () from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#7  0x00007f7603cd735a in geos::index::strtree::AbstractSTRtree::createParentBoundables(std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*, int) ()
   from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#8  0x00007f7603cdd79e in geos::index::strtree::STRtree::createParentBoundablesFromVerticalSlice(std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*, int) ()
   from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#9  0x00007f7603cde1bf in geos::index::strtree::STRtree::createParentBoundablesFromVerticalSlices(std::vector<std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*, std::allocator<std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*> >*, int) () from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#10 0x00007f7603cde426 in geos::index::strtree::STRtree::createParentBoundables(std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*, int) ()
   from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#11 0x00007f7603cd68e0 in geos::index::strtree::AbstractSTRtree::createHigherLevels(std::vector<geos::index::strtree::Boundable*, std::allocator<geos::index::strtree::Boundable*> >*, int) ()
   from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#12 0x00007f7603cd6881 in geos::index::strtree::AbstractSTRtree::build() () from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#13 0x00007f7603cd6de3 in geos::index::strtree::AbstractSTRtree::query(void const*, geos::index::ItemVisitor&) () from /home/joris/miniconda3/envs/geo-dev/lib/./libgeos-3.8.1.so
#14 0x00007f7603dc712a in GEOSSTRtree_query_r () from /home/joris/miniconda3/envs/geo-dev/lib/libgeos_c.so
#15 0x00007f7603b95edf in STRtree_query_bulk (self=0x7f75fd7ee180, args=<optimized out>) at src/strtree.c:488
#16 0x0000562ba1a993bc in method_vectorcall_VARARGS (func=<optimized out>, args=0x7f75f9ba3a88, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/descrobject.c:300
#17 0x0000562ba1b06e09 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f75f9ba3a88, callable=<method_descriptor at remote 0x7f7603bd8310>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Include/cpython/abstract.h:127
#18 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x562ba6193f30) at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:4987
#19 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:3486
#20 0x0000562ba1ab1cd4 in _PyEval_EvalCodeWithName (_co=<code at remote 0x7f7603b08d40>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x0, 
    kwargs=0x7f75f9ba4bc0, kwcount=<optimized out>, kwstep=1, defs=0x7f7603b28268, defcount=1, kwdefs=0x0, closure=0x0, name='query_bulk', qualname='STRtree.query_bulk')
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:4298
#21 0x0000562ba1ab2bb5 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f75f9ba4ba8, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/call.c:435
#22 0x0000562ba1a8ed65 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f75f9ba4ba8, callable=<function at remote 0x7f7603b2b280>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Include/cpython/abstract.h:127
#23 method_vectorcall (method=<optimized out>, args=0x7f75f9ba4bb0, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/classobject.c:60
#24 0x0000562ba1b0b518 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f75f9ba4bb0, callable=<method at remote 0x7f75fda7d600>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Include/cpython/abstract.h:127
#25 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x562ba6193f30) at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:4987
#26 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:3469
#27 0x0000562ba1ab2049 in _PyEval_EvalCodeWithName (_co=<code at remote 0x7f76038fe2f0>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kwnames=0x7f75fdadf3d8, 
    kwargs=0x7f75d8001440, kwcount=<optimized out>, kwstep=1, defs=0x7f76038fb518, defcount=2, kwdefs=0x0, closure=(<cell at remote 0x7f76038f6a00>,), name='query_bulk', 
    qualname='PyGEOSSTRTreeIndex.query_bulk') at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Python/ceval.c:4298
#28 0x0000562ba1ab2c58 in _PyFunction_Vectorcall (func=<optimized out>, stack=0x7f75d8001430, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/call.c:435
#29 0x0000562ba1a8ed65 in _PyObject_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, args=0x7f75d8001430, callable=<function at remote 0x7f7603900700>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Include/cpython/abstract.h:127
#30 method_vectorcall (method=<optimized out>, args=0x7f75d8001438, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python_1587713251558/work/Objects/classobject.c:60
#31 0x0000562ba1b07cce in _PyObject_Vectorcall (kwnames=('predicate', 'sort'), nargsf=<optimized out>, args=<optimized out>, callable=<method at remote 0x7f75fda7d300>)

where it seems that the memory corruption comes from inside creating the STRtree (AbstractSTRtree::build()), but I didn't yet further investigate (probably could use better tools, or a debug build of GEOS, etc).

Currently I am reproducing this with an environment based on conda-forge with Python 3.8, GEOS 3.8.1 and pygeos master (but also see it in an environment with GEOS 3.9.1 on Python 3.9, but I don't have gdb installed there, and the segfault message there is "free(): invalid pointer").

The test_parallel_sjoin.py script (it is using the "states and provinces" from NaturalEarth; I am not able to reproduce it with the smaller countries dataset included with geopandas):

from concurrent.futures import ThreadPoolExecutor
import itertools

import numpy as np
import pandas as pd
import geopandas


## Create data

# polygon data

gdf = geopandas.read_file("ne_10m_admin_1_states_provinces.zip")
# gdf = geopandas.read_file(geopandas.datasets.get_path("naturalearth_lowres"))
# for purposes of the demo, take a subset of the columns
gdf = gdf[["name", "adm0_a3", "type_en", "geometry"]]
# and drop Antartica
gdf = gdf[gdf["adm0_a3"] != "ATA"]
# and convert to Mercator (don't do this at home for global area calculation!)
gdf = gdf.to_crs("EPSG:3857")

# point data (random points in bbox of polygon layer)

minx, miny, maxx, maxy = gdf.total_bounds
N = 1_000_000
df_points = geopandas.GeoDataFrame(
    {"id": np.arange(N)},
    geometry=geopandas.points_from_xy(minx + maxx * np.random.random(N), miny + maxy * np.random.random(N)),
    crs=gdf.crs
)

## Slice parts of the dataframes -> 4x4 => 16 combinations

n = int(len(gdf) / 4)
gdf_parts = [gdf[:n], gdf[n:2*n], gdf[2*n:3*n], gdf[3*n:]]

n = int(len(df_points) / 4)
df_points_parts = [df_points[:n], df_points[n:2*n], df_points[2*n:3*n], df_points[3*n:]]


## The function calling geopandas.sjoin / pygeos.STRtree in parallel

def thread_func(idxs):
    i, j = idxs
    left = df_points_parts[i]
    right = gdf_parts[j]
    return geopandas.sjoin(left, right, op="within")


def main():
    with ThreadPoolExecutor() as pool:
        res = list(pool.map(thread_func, itertools.product(range(4), range(4))))
        print(pd.concat(res))


if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions