Integrate tiered spatial join with existing SqlQueryRunner#1325
Integrate tiered spatial join with existing SqlQueryRunner#1325rfecher merged 2 commits intolocationtech:masterfrom
Conversation
|
Over the weekend I had a mental breakthrough moment regarding how I wanted to refactor and connect all the pieces for the Spark project API. I've made good progress toward that goal and almost have everything back to working state with a few exceptions. I'm trying to figure out how to best apply IndexStrategies to the new GeoWaveIndexedRDD class, but running into issues with Serialization. |
107b190 to
b877e8b
Compare
rfecher
left a comment
There was a problem hiding this comment.
minor change to consolidate code
| return hasLat && hasLon; | ||
| } | ||
|
|
||
| public static boolean isSpatial( |
There was a problem hiding this comment.
can we just delegate isSpatial(PrimaryIndex) to call isSpatial(NumericIndexStrategy) with one extra null check for the index instead of duplicate code?
| return hasTime && hasLat && hasLon; | ||
| } | ||
|
|
||
| public static boolean isSpatialTemporal( |
Ended up pulling out the logic I was working on attempting to handle compound conditionals after realizing how limited the "working" logic I had was and that it couldn't hope to properly support compound conditional statements that have parenthesis order applied to the logic etc..
Ultimately, this integration method is fairly limited and doesn't work exactly as I theorized due to how Spark plans the sql query execution from the string statement. I was hoping to take a query like this:
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom)"
Perform the necessary spatial join beforehand (which I can successfully do) then take the results and overwrite the original views with the new joined frames. Afterwards, removing that where clause and feeding any remaining sql commands to be run back to the original query executor.
"select hail.* from hail, tornado"
Instead, the conditional must remain in the sql statement which I believe means spark will attempt to regenerate the output hashes for the join between the tables, but the generated pairs will already exist on the machine from the join beforehand so it should cache most if not all matching pairs.
Performing this has also put another potential issue/short-coming of the spatial join logic on my radar being how to handle join operations with NOT logic present in the condition.
"select hail.* from hail, tornado where !GeomIntersects(hail.geom,tornado.geom)"
or
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom) == false"
or
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom) !=(<>) true"
Attempting to perform this logic will almost certainly produce incorrect results because the join currently assumes you want to check that the predicates return is 'true'. This attempted query breaks the join logic because it requires you to exclude anything that successfully passes the filter from the final results.
I was thinking of a potential easy workaround for this being an option that signifies a negativePredicateTest which I would do the normal algorithm but when I go to join the results to the original sets to get the final features I would instead do a subtraction of the RDDs
leftResults = leftOriginalRDD.subtractByKey(combinedResults)
Assuming combinedResults contains only the rowKeys you want to remove this logic should be sound, but I wasn't sure it would work with all predicates? Then there are predicates like GeomDisjoint which technically needs a negative predicate test, but should still check if the predicate itself returns 'true'. I could use some assistance thinking about my API and how to better tie some of these operation dependent ends together without creating a maintenance/extension nightmare for anyone adding new GeomPredicates.