store individual histogram statistic per partition#1339
store individual histogram statistic per partition#1339rfecher merged 3 commits intolocationtech:masterfrom
Conversation
b010555 to
5191962
Compare
5191962 to
156d01e
Compare
jdgarrett
left a comment
There was a problem hiding this comment.
A few minor things, overall it looks good to me.
| dataAdapter); | ||
| } | ||
| //this is done primarily to ensure stats merging is enabled before the distributed ingest | ||
| if (dataStoreOptions.getFactoryOptions().getStoreOptions().isPersistDataStatistics()) { |
There was a problem hiding this comment.
This can be moved outside of the if/else block to avoid code duplication.
| @@ -0,0 +1,7 @@ | |||
| package mil.nga.giat.geowave.core.store.adapter.statistics; | |||
|
|
|||
| public interface DataStatisticsSet<T> extends | |||
There was a problem hiding this comment.
Javadoc comment with brief explanation of the need for this class.
| import net.sf.json.JSONException; | ||
| import net.sf.json.JSONObject; | ||
|
|
||
| public class PartitionStatistics<T> extends |
There was a problem hiding this comment.
Javadoc comment with a brief explanation of this class.
| else { | ||
| // otherwise we keep a map per partition to call ingested once | ||
| // per partition on each partition's individual statistic | ||
| Map<ByteArrayId, List<GeoWaveRow>> rowsPerPartition = new HashMap<>(); |
There was a problem hiding this comment.
Wouldn't it be more efficient to just call entryIngested once for every row, rather than building up this intermediate Map of Lists? It would also eliminate the need to have separate logic for 1 row vs many. I'm not sure what work the map is trying to avoid, since the histograms are cached in histogramPerPartition anyways.
There was a problem hiding this comment.
it was collecting rows of the same partition key to delegate the array of matching partition keys to the histogram statistic's callback rather than single row at a time (in practice, they're often going to have the same partition key). I don't really know if there's an efficiency gain for that, and I agree, it minimizes complexity to simply delegate a row at a time. I think I was thinking more in terms of maintaining the typical contract of calling the callback once per entry rather than once per row. Some stats are more concerned about entries (for example, the count counts entries, not rows) and some work on the rows themselves while some may use both in context. For example, if the individual histogram statistic decided to normalize weight contribution by the number of rows to keep each entry's contribution constant it'd be able to (so if a single entry took up 5 rows, it could add 1/5 the weight per row). That's not the way it works now, and can buy that we can just minimize some complexity here, but yeah, just trying to stick to the contract in case something changes with how we keep histograms.
No description provided.