Skip to content

store individual histogram statistic per partition#1339

Merged
rfecher merged 3 commits intolocationtech:masterfrom
rfecher:partition-histo-rebase
Jun 5, 2018
Merged

store individual histogram statistic per partition#1339
rfecher merged 3 commits intolocationtech:masterfrom
rfecher:partition-histo-rebase

Conversation

@rfecher
Copy link
Copy Markdown
Contributor

@rfecher rfecher commented May 31, 2018

No description provided.

@coveralls
Copy link
Copy Markdown

coveralls commented Jun 1, 2018

Coverage Status

Coverage increased (+0.3%) to 47.233% when pulling 5d61214 on rfecher:partition-histo-rebase into b181e64 on locationtech:master.

@rfecher rfecher force-pushed the partition-histo-rebase branch 4 times, most recently from b010555 to 5191962 Compare June 4, 2018 15:38
@rfecher rfecher force-pushed the partition-histo-rebase branch from 5191962 to 156d01e Compare June 4, 2018 16:36
Copy link
Copy Markdown
Contributor

@jdgarrett jdgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, overall it looks good to me.

dataAdapter);
}
//this is done primarily to ensure stats merging is enabled before the distributed ingest
if (dataStoreOptions.getFactoryOptions().getStoreOptions().isPersistDataStatistics()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved outside of the if/else block to avoid code duplication.

@@ -0,0 +1,7 @@
package mil.nga.giat.geowave.core.store.adapter.statistics;

public interface DataStatisticsSet<T> extends
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc comment with brief explanation of the need for this class.

import net.sf.json.JSONException;
import net.sf.json.JSONObject;

public class PartitionStatistics<T> extends
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc comment with a brief explanation of this class.

else {
// otherwise we keep a map per partition to call ingested once
// per partition on each partition's individual statistic
Map<ByteArrayId, List<GeoWaveRow>> rowsPerPartition = new HashMap<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be more efficient to just call entryIngested once for every row, rather than building up this intermediate Map of Lists? It would also eliminate the need to have separate logic for 1 row vs many. I'm not sure what work the map is trying to avoid, since the histograms are cached in histogramPerPartition anyways.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was collecting rows of the same partition key to delegate the array of matching partition keys to the histogram statistic's callback rather than single row at a time (in practice, they're often going to have the same partition key). I don't really know if there's an efficiency gain for that, and I agree, it minimizes complexity to simply delegate a row at a time. I think I was thinking more in terms of maintaining the typical contract of calling the callback once per entry rather than once per row. Some stats are more concerned about entries (for example, the count counts entries, not rows) and some work on the rows themselves while some may use both in context. For example, if the individual histogram statistic decided to normalize weight contribution by the number of rows to keep each entry's contribution constant it'd be able to (so if a single entry took up 5 rows, it could add 1/5 the weight per row). That's not the way it works now, and can buy that we can just minimize some complexity here, but yeah, just trying to stick to the contract in case something changes with how we keep histograms.

@rfecher rfecher merged commit f1d32f1 into locationtech:master Jun 5, 2018
@rfecher rfecher deleted the partition-histo-rebase branch June 5, 2018 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants