Skip to content

Integrate tiered spatial join with existing SqlQueryRunner#1325

Merged
rfecher merged 2 commits intolocationtech:masterfrom
JWileczek:sql-join
Apr 25, 2018
Merged

Integrate tiered spatial join with existing SqlQueryRunner#1325
rfecher merged 2 commits intolocationtech:masterfrom
JWileczek:sql-join

Conversation

@JWileczek
Copy link
Copy Markdown
Contributor

Ended up pulling out the logic I was working on attempting to handle compound conditionals after realizing how limited the "working" logic I had was and that it couldn't hope to properly support compound conditional statements that have parenthesis order applied to the logic etc..
Ultimately, this integration method is fairly limited and doesn't work exactly as I theorized due to how Spark plans the sql query execution from the string statement. I was hoping to take a query like this:
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom)"
Perform the necessary spatial join beforehand (which I can successfully do) then take the results and overwrite the original views with the new joined frames. Afterwards, removing that where clause and feeding any remaining sql commands to be run back to the original query executor.
"select hail.* from hail, tornado"
Instead, the conditional must remain in the sql statement which I believe means spark will attempt to regenerate the output hashes for the join between the tables, but the generated pairs will already exist on the machine from the join beforehand so it should cache most if not all matching pairs.

Performing this has also put another potential issue/short-coming of the spatial join logic on my radar being how to handle join operations with NOT logic present in the condition.
"select hail.* from hail, tornado where !GeomIntersects(hail.geom,tornado.geom)"
or
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom) == false"
or
"select hail.* from hail, tornado where GeomIntersects(hail.geom,tornado.geom) !=(<>) true"
Attempting to perform this logic will almost certainly produce incorrect results because the join currently assumes you want to check that the predicates return is 'true'. This attempted query breaks the join logic because it requires you to exclude anything that successfully passes the filter from the final results.
I was thinking of a potential easy workaround for this being an option that signifies a negativePredicateTest which I would do the normal algorithm but when I go to join the results to the original sets to get the final features I would instead do a subtraction of the RDDs
leftResults = leftOriginalRDD.subtractByKey(combinedResults)
Assuming combinedResults contains only the rowKeys you want to remove this logic should be sound, but I wasn't sure it would work with all predicates? Then there are predicates like GeomDisjoint which technically needs a negative predicate test, but should still check if the predicate itself returns 'true'. I could use some assistance thinking about my API and how to better tie some of these operation dependent ends together without creating a maintenance/extension nightmare for anyone adding new GeomPredicates.

@JWileczek JWileczek requested a review from rfecher April 19, 2018 11:48
@JWileczek
Copy link
Copy Markdown
Contributor Author

Over the weekend I had a mental breakthrough moment regarding how I wanted to refactor and connect all the pieces for the Spark project API. I've made good progress toward that goal and almost have everything back to working state with a few exceptions. I'm trying to figure out how to best apply IndexStrategies to the new GeoWaveIndexedRDD class, but running into issues with Serialization.
Overall I think this API is much cleaner, easier to understand, and should be considerable more extensible than the previous. There were a couple core changes I made, primarily to the visibility of some functions within the SpatialTypeProviders that I felt were safe to make but if there are more appropriate ways to do the changes I made please let me know.
I didn't expect to make such a big overhaul in this PR, but everything clicked rather suddenly and I just started coding without considering the branch I was still working on.

@JWileczek JWileczek force-pushed the sql-join branch 2 times, most recently from 107b190 to b877e8b Compare April 23, 2018 17:12
Copy link
Copy Markdown
Contributor

@rfecher rfecher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change to consolidate code

return hasLat && hasLon;
}

public static boolean isSpatial(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just delegate isSpatial(PrimaryIndex) to call isSpatial(NumericIndexStrategy) with one extra null check for the index instead of duplicate code?

return hasTime && hasLat && hasLon;
}

public static boolean isSpatialTemporal(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as isSpatial

@rfecher rfecher merged commit 00c9096 into locationtech:master Apr 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants