KMeans implementation using Spark ML by blastarr · Pull Request #1137 · locationtech/geowave

blastarr · 2017-07-21T05:08:35Z

No description provided.

rfecher · 2017-07-25T13:21:12Z

@@ -0,0 +1,314 @@
+package mil.nga.giat.geowave.analytic.javaspark;


any chance to make unit tests here?

rfecher · 2017-07-25T13:22:30Z

+		}
+
+		stopwatch.stop();
+		LOGGER.warn("KMeans runner took " + stopwatch.getTimeString());


warning messages aren't particularly appropriate or user friendly (can we do debug or info instead)

rfecher · 2017-07-25T13:25:01Z

+		"-i",
+		"--numIterations"
+	}, description = "The number of iterations to run")
+	private Integer numIterations = 20;


was 20 chosen as a default based on anything (ie. some trial and error, experimentation, or mllib recommendations) or is it more or less random?

rfecher · 2017-07-25T13:33:16Z

+
+		// Init the algorithm
+		KMeans kmeans = new KMeans();
+		kmeans.setInitializationMode("kmeans||");


there is a "random" initialization mode, and for kmeans|| there is initiatialization steps which defaults to 2 - both of these is probably the 99% case but I'm wondering if there's justification to have additional optional parameters for these or to keep it simple. Really, the only justification in my mind is likely if scaling to billions/trillions is a challenge re: performance with kmeans|| ... we should test this at some point on billions of geometries

reminder - create a new issue for this

rfecher · 2017-07-25T13:33:52Z

+					"kmeans-hulls");
+
+			stopwatch.stop();
+			LOGGER.warn("KMeans hull generation took " + stopwatch.getTimeString());


again, change from warning

rfecher · 2017-07-25T17:25:14Z

+
+		if (adapterId != null) {
+			ScaledTemporalRange scaledRange = new ScaledTemporalRange();
+			String timeField = FeatureDataUtils.getFirstTimeField(


it seems there should be a couple possible ways to set this outside of using the first time field, the feature data adapter should always have a configured time field(s) that it uses for temporal indexing, and perhaps there's an option to override the indexed time

actually it seems there's a disconnect between what is queried and this ScaledTemporalRange...the adapter ID doesn't have to be set (which makes sense) and it'll query all the objects in the table (speaking of which, there is potential for class cast exceptions if those objects aren't features, but we can just ignore that) but the features are then ignored maybe, perhaps unclear what the behavior would be...not sure if this is a corner case we should be thinking about, but it does seem like a disconnect

rfecher · 2017-07-25T17:26:53Z

+	}, description = "Bounding box for spatial query (LL-Lat LL-Lon UR-Lat UR-Lon)")
+	private String bbox = null;
+
+	@Parameter(names = {


this onle works on SimpleFeatures so this is probably more intuitively called something like "featureTypeName"

and maybe in the description we could say that this is also the GeoWave adapter ID

rfecher · 2017-07-25T17:38:20Z

+			String bboxStr ) {
+		try {
+			// Expecting bbox in "LL-Lat LL-Lon UR-Lat UR-Lon" format
+			StringTokenizer tok = new StringTokenizer(


a bit of a nit, but really everywhere else we use longitude-first, and our bbox expressions through geotools CQL would be longitude-first. To be more consistent with everything else I suggest longitude first. Otherwise I also kind of like the idea of just 4 individual parameters for east, west, north, and south so no one needs to second guess the order (for me I'd have to print help every time I wanted to use this, to verify order)...anyways no big deal one way or the other, just make sure we stay with longitude first

agree. we'll change it to x1 y1 x2 y2

rfecher · 2017-07-25T17:44:40Z

+
+		while (statsIt.hasNext()) {
+			final DataStatistics stats = statsIt.next();
+			if (stats instanceof FeatureTimeRangeStatistics) {


there's a method public DataStatistics<?> getDataStatistics( ByteArrayId adapterId, ByteArrayId statisticsId, String... authorizations ) on statistics store that is intended to get you exactly the stat you want rather than this iteration...in this case there can be more than one time attribute so this can be incorrect (order of the statistics coming back is nondeterministic), but the statisticsId uses composeId with the attribute name (field ID)

FeatureTimeRangeStatistics.composeId() gets you the statistics ID that can then be used within that call to DataStatisticsStore

rfecher · 2017-07-25T17:45:35Z

+			final DataStorePluginOptions dataStorePlugin,
+			final ByteArrayId adapterId ) {
+		final DataStatisticsStore statisticsStore = dataStorePlugin.createDataStatisticsStore();
+		final CloseableIterator<DataStatistics<?>> statsIt = statisticsStore.getDataStatistics(adapterId);


there's a method public DataStatistics<?> getDataStatistics( ByteArrayId adapterId, ByteArrayId statisticsId, String... authorizations ) on statistics store that is intended to get you exactly the stat you want rather than this iteration...in this case there can be more than one geometry attribute so this can be incorrect (order of the statistics coming back is nondeterministic), but the statisticsId uses composeId with the attribute name (field ID)

rfecher · 2017-07-25T19:17:11Z

+		QueryOptions queryOptions = null;
+		if (adapterId != null) {
+			// Retrieve the adapters
+			CloseableIterator<DataAdapter<?>> adapterIt = inputDataStore.createAdapterStore().getAdapters();


you can just setAdapterId() on queryoptions and not need to iterate through adapters to find a match

rfecher · 2017-07-25T19:37:32Z

+		DataAdapter adapter = adapterStore.getAdapter(adapterId);
+
+		if (adapter != null && adapter instanceof GeotoolsFeatureDataAdapter) {
+			GeotoolsFeatureDataAdapter gtAdapter = (GeotoolsFeatureDataAdapter) adapter;


adapter.getTimeDescriptors()

rfecher · 2017-07-25T19:41:17Z

+
+		while (statsIt.hasNext()) {
+			final DataStatistics stats = statsIt.next();
+			if (stats instanceof FeatureTimeRangeStatistics) {


FeatureTimeRangeStatistics.composeId() gets you the statistics ID that can then be used within that call to DataStatisticsStore

rfecher requested changes Jul 25, 2017

View reviewed changes

blastarr and others added 2 commits July 28, 2017 15:15

KMeans implementation using Spark ML

474532e

merged kmeans-spark with master

7c1f557

rfecher force-pushed the kmeans-spark branch from 197c052 to 7c1f557 Compare July 28, 2017 19:21

rfecher merged commit f4a5e6e into master Jul 28, 2017

rfecher deleted the kmeans-spark branch July 28, 2017 20:24

		@@ -0,0 +1,314 @@
		package mil.nga.giat.geowave.analytic.javaspark;

Conversation

blastarr commented Jul 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants