Implement the Distributed Spatial Join Algorithm by JWileczek · Pull Request #1289 · locationtech/geowave

JWileczek · 2018-03-14T06:58:48Z

Commits have not been squashed because I expect I may need to make a few changes before everything gets pulled in. Half expecting tests to fail because local testing is broken (see below).

A few pain points in this pull request that need to be addressed:

After updating with master and cleaning my unit tests further I went to do a final run locally of them locally and ran into the Jackson dependency version issue. I couldn't remember if this was addressed in the recent changes on master or if it will be fixed after this PR.
Currently implementing SpatialJoinRunner to abstract join selection logic and work into query runner and cli interface, but this is really work related to exposing the join to those various api's that will come in later user stories. Should this class be removed before the release since it isn't effectively usable/tested? I can stash it and add it in later requests when it is more fleshed out.
Semi-related to above, There is a second join implementation using the Dataset api in spark that is moved into a experimental package within the project. This join works, but is less efficient, however the Dataset api is supposed to replace the RDD api in Spark, so there could be some value in pursuing this implementation further. I feel there is less pain keeping this implementation around in code for release because it does produce correct results just isn't the recommended one at the time. However it too can be removed and added at a later point.
Not happy with the BucketOperation/BufferOperation interface for dealing with spatial operations that require buffering indices before the join (aka distance). I think it could potentially be done a cleaner way that I don't currently see. Could use a short discussion on the issue to see if a cleaner api can be made quickly.

rfecher

see line comments

rfecher · 2018-03-16T15:04:32Z

+	}
+
+	@Override
+	public Boolean call(


this is unnecessary

rfecher · 2018-03-16T15:17:05Z

+
+		}
+		//Remove duplicates between tiers
+		//JavaPairRDD<GeoWaveInputKey, ByteArrayId> swappedResults = this.combinedResults.mapToPair(t -> t.swap()).reduceByKey((id1, id2) -> id1);


no commented out code

rfecher · 2018-03-16T15:17:17Z

+		//JavaPairRDD<GeoWaveInputKey, ByteArrayId> swappedResults = this.combinedResults.mapToPair(t -> t.swap()).reduceByKey((id1, id2) -> id1);
+		this.combinedResults = this.combinedResults.reduceByKey((id1,id2) -> id1);
+
+		//swappedResults.cache();


no commented out code

rfecher · 2018-03-16T15:18:05Z

+							}
+						}
+
+						// List<ByteArrayId> insertIds =


no commented out code

rfecher · 2018-03-16T15:18:30Z

+		//Cogroup groups on same tier ByteArrayId and pairs them into Iterable sets.
+		JavaPairRDD<ByteArrayId, Tuple2<Iterable<Tuple2<GeoWaveInputKey, Geometry>>, Iterable<Tuple2<GeoWaveInputKey, Geometry>>>> joinedTiers = leftTier.cogroup(rightTier, highestPartitionCount);
+		//Filter only the pairs that have data on both sides, bucket strategy should have been accounted for by this point.
+		//joinedTiers.cache();


no commented out code

rfecher · 2018-03-16T15:18:45Z

+
+				GeomFunction predicate = geomPredicate.value();
+
+				//HashSet<Tuple2<GeoWaveInputKey, Geometry>> resultSet = Sets.newHashSet();


no commented out code

rfecher · 2018-03-16T15:20:24Z

-			throw new ParameterException(
-					"HDFS Base path must start with forward slash /");
-		}
+		// if (!basePath.startsWith("/")) {


no commented out code

rfecher · 2018-03-16T15:20:36Z

-			throw new ParameterException(
-					"HDFS Base path must start with forward slash /");
-		}
+		// if (!basePath.startsWith("/")) {


no commented out code

rfecher · 2018-03-16T15:22:48Z

+			LOGGER.error("Async error in join");
+			e.printStackTrace();
+		}
+		// typedJoin.join(session, hailRDD, tornadoRDD, distancePredicate,


no commented out code

rfecher · 2018-03-16T15:26:02Z

+		FeatureSerializer simpleSerializer = new FeatureSerializer();
+		PersistableSerializer persistSerializer = new PersistableSerializer();
+
+		kryo.register(


should be easy to register all persistables using PersistableFactory.getInstance().getClassIdMapping().entrySet().forEach(e -> kryo.register(e.getKey, simpleSerializer, e.getValue())); in place of specific concrete class registration.

rfecher · 2018-03-20T15:29:57Z


 		return new CustomIdIndex(
-				XZHierarchicalIndexFactory.createFullIncrementalTieredStrategy(
+				TieredSFCIndexFactory.createFullIncrementalTieredStrategy(


doesn't this change our indexing approach for everything

rfecher

check out the comment

JWileczek force-pushed the spatial-join branch 2 times, most recently from 063e426 to 548b200 Compare March 16, 2018 07:37

JWileczek added a commit that referenced this pull request Mar 16, 2018

Implement the Distributed Spatial Join Algorithm #1289

548b200

JWileczek force-pushed the spatial-join branch from 548b200 to aaf7a02 Compare March 16, 2018 14:18

JWileczek added a commit that referenced this pull request Mar 16, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

aaf7a02

rfecher requested changes Mar 16, 2018

View reviewed changes

JWileczek force-pushed the spatial-join branch from aaf7a02 to 0497044 Compare March 16, 2018 17:47

JWileczek added a commit that referenced this pull request Mar 16, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

0497044

JWileczek force-pushed the spatial-join branch from 0497044 to d8afe87 Compare March 16, 2018 21:05

JWileczek added a commit that referenced this pull request Mar 16, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

d8afe87

JWileczek force-pushed the spatial-join branch from d8afe87 to 12e0d26 Compare March 19, 2018 15:45

JWileczek added a commit that referenced this pull request Mar 19, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

12e0d26

JWileczek force-pushed the spatial-join branch from 12e0d26 to 1cd53a9 Compare March 19, 2018 15:50

JWileczek added a commit that referenced this pull request Mar 19, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

1cd53a9

rfecher reviewed Mar 20, 2018

View reviewed changes

JWileczek force-pushed the spatial-join branch from 1cd53a9 to bac3240 Compare March 20, 2018 19:14

JWileczek added a commit that referenced this pull request Mar 20, 2018

Implement the Distributed Spatial Join Algorithm (#1289)

bac3240

Implement the Distributed Spatial Join Algorithm (#1289)

1491300

JWileczek force-pushed the spatial-join branch from bac3240 to 1491300 Compare March 22, 2018 00:55

rfecher approved these changes Mar 22, 2018

View reviewed changes

rfecher merged commit 199660d into master Mar 22, 2018

rfecher deleted the spatial-join branch March 22, 2018 14:59


		GeomFunction predicate = geomPredicate.value();

		//HashSet<Tuple2<GeoWaveInputKey, Geometry>> resultSet = Sets.newHashSet();

Conversation

JWileczek commented Mar 14, 2018

Uh oh!

rfecher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfecher left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants