-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I've got a somewhat complicated procedure that I need to parallelize across cores. I need to do a bunch of geographical checking, where I compare ~700k lat/long points to a bunch of geographical boundaries and gauge distance from those boundaries. I have some native python code that does this, and can't parallelize it with the standard multiprocessing library because the shapely objects can't be serialized. I've got some dask code that implements this, but just hangs when I run it.
I can see multiple cores fire up, but only one working at a time (there is 0% CPU showing on all but one core). In this post (http://bit.ly/1U8JhmJ), I read @mrocklin comments and wondered if my use case just doesn't lend itself to mutliprocessing with dask here because it's pretty memory inefficient. In order to multiprocess, the geographical boundaries have to be farmed out to each of the cores, and I thought with dask I may be able to avoid that by taking advantage of shared memory (otherwise it's terribly memory inefficient). But, I think with the GIL a copy of the geographical boundaries still have to be farmed out to each core. Am I off base here, or does this sound right?
I had come to the conclusion that I would just need to move to a lower level language to get around the GIL, but talked to @chdoig today and she suggested I at least bring this up and let you know what I'm trying to use dask for.
On a less related note, cool product! While I haven't been able to get it to do what I want, I think this is an awesome tool.