Skip to content

KMeans implementation using Spark ML#1137

Merged
rfecher merged 2 commits intomasterfrom
kmeans-spark
Jul 28, 2017
Merged

KMeans implementation using Spark ML#1137
rfecher merged 2 commits intomasterfrom
kmeans-spark

Conversation

@blastarr
Copy link
Copy Markdown
Contributor

No description provided.

@@ -0,0 +1,314 @@
package mil.nga.giat.geowave.analytic.javaspark;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance to make unit tests here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELETE ME

}

stopwatch.stop();
LOGGER.warn("KMeans runner took " + stopwatch.getTimeString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning messages aren't particularly appropriate or user friendly (can we do debug or info instead)

"-i",
"--numIterations"
}, description = "The number of iterations to run")
private Integer numIterations = 20;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was 20 chosen as a default based on anything (ie. some trial and error, experimentation, or mllib recommendations) or is it more or less random?


// Init the algorithm
KMeans kmeans = new KMeans();
kmeans.setInitializationMode("kmeans||");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a "random" initialization mode, and for kmeans|| there is initiatialization steps which defaults to 2 - both of these is probably the 99% case but I'm wondering if there's justification to have additional optional parameters for these or to keep it simple. Really, the only justification in my mind is likely if scaling to billions/trillions is a challenge re: performance with kmeans|| ... we should test this at some point on billions of geometries

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder - create a new issue for this

"kmeans-hulls");

stopwatch.stop();
LOGGER.warn("KMeans hull generation took " + stopwatch.getTimeString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, change from warning


if (adapterId != null) {
ScaledTemporalRange scaledRange = new ScaledTemporalRange();
String timeField = FeatureDataUtils.getFirstTimeField(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems there should be a couple possible ways to set this outside of using the first time field, the feature data adapter should always have a configured time field(s) that it uses for temporal indexing, and perhaps there's an option to override the indexed time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it seems there's a disconnect between what is queried and this ScaledTemporalRange...the adapter ID doesn't have to be set (which makes sense) and it'll query all the objects in the table (speaking of which, there is potential for class cast exceptions if those objects aren't features, but we can just ignore that) but the features are then ignored maybe, perhaps unclear what the behavior would be...not sure if this is a corner case we should be thinking about, but it does seem like a disconnect

}, description = "Bounding box for spatial query (LL-Lat LL-Lon UR-Lat UR-Lon)")
private String bbox = null;

@Parameter(names = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this onle works on SimpleFeatures so this is probably more intuitively called something like "featureTypeName"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe in the description we could say that this is also the GeoWave adapter ID

String bboxStr ) {
try {
// Expecting bbox in "LL-Lat LL-Lon UR-Lat UR-Lon" format
StringTokenizer tok = new StringTokenizer(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit of a nit, but really everywhere else we use longitude-first, and our bbox expressions through geotools CQL would be longitude-first. To be more consistent with everything else I suggest longitude first. Otherwise I also kind of like the idea of just 4 individual parameters for east, west, north, and south so no one needs to second guess the order (for me I'd have to print help every time I wanted to use this, to verify order)...anyways no big deal one way or the other, just make sure we stay with longitude first

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. we'll change it to x1 y1 x2 y2


while (statsIt.hasNext()) {
final DataStatistics stats = statsIt.next();
if (stats instanceof FeatureTimeRangeStatistics) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a method public DataStatistics<?> getDataStatistics( ByteArrayId adapterId, ByteArrayId statisticsId, String... authorizations ) on statistics store that is intended to get you exactly the stat you want rather than this iteration...in this case there can be more than one time attribute so this can be incorrect (order of the statistics coming back is nondeterministic), but the statisticsId uses composeId with the attribute name (field ID)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureTimeRangeStatistics.composeId() gets you the statistics ID that can then be used within that call to DataStatisticsStore

final DataStorePluginOptions dataStorePlugin,
final ByteArrayId adapterId ) {
final DataStatisticsStore statisticsStore = dataStorePlugin.createDataStatisticsStore();
final CloseableIterator<DataStatistics<?>> statsIt = statisticsStore.getDataStatistics(adapterId);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a method public DataStatistics<?> getDataStatistics( ByteArrayId adapterId, ByteArrayId statisticsId, String... authorizations ) on statistics store that is intended to get you exactly the stat you want rather than this iteration...in this case there can be more than one geometry attribute so this can be incorrect (order of the statistics coming back is nondeterministic), but the statisticsId uses composeId with the attribute name (field ID)

QueryOptions queryOptions = null;
if (adapterId != null) {
// Retrieve the adapters
CloseableIterator<DataAdapter<?>> adapterIt = inputDataStore.createAdapterStore().getAdapters();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just setAdapterId() on queryoptions and not need to iterate through adapters to find a match

DataAdapter adapter = adapterStore.getAdapter(adapterId);

if (adapter != null && adapter instanceof GeotoolsFeatureDataAdapter) {
GeotoolsFeatureDataAdapter gtAdapter = (GeotoolsFeatureDataAdapter) adapter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adapter.getTimeDescriptors()


while (statsIt.hasNext()) {
final DataStatistics stats = statsIt.next();
if (stats instanceof FeatureTimeRangeStatistics) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureTimeRangeStatistics.composeId() gets you the statistics ID that can then be used within that call to DataStatisticsStore

@rfecher rfecher merged commit f4a5e6e into master Jul 28, 2017
@rfecher rfecher deleted the kmeans-spark branch July 28, 2017 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants