Hadoop speculative execution is on by default for mappers and reducers
(ref: https://hadoop.apache.org/docs/r2.5.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml )
This will, based on certain criteria, start up duplicate instances of mappers/reducers. Only one "wins" - the rest are terminated - but if there are side effects to their actions (i.e. output format + data schema isn't idempotent) then duplicates occur.
It's also dubious whether we want to potentially generate extra reads/writes through accumulo.
Might warrant more discussion, but I think in the ingest framework we probably want to disable this by default, disable it in the prototype deduplicating mapper/reducers, and document it in the input/output format examples.
(looks like there's a convienence method on the Job class to turn it off - I'm assuming both for mappers and reducers: ref: http://hadoop.apache.org/docs/r2.5.2/api/org/apache/hadoop/mapreduce/Job.html#setSpeculativeExecution(boolean) )
Hadoop speculative execution is on by default for mappers and reducers
(ref: https://hadoop.apache.org/docs/r2.5.2/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml )
This will, based on certain criteria, start up duplicate instances of mappers/reducers. Only one "wins" - the rest are terminated - but if there are side effects to their actions (i.e. output format + data schema isn't idempotent) then duplicates occur.
It's also dubious whether we want to potentially generate extra reads/writes through accumulo.
Might warrant more discussion, but I think in the ingest framework we probably want to disable this by default, disable it in the prototype deduplicating mapper/reducers, and document it in the input/output format examples.
(looks like there's a convienence method on the Job class to turn it off - I'm assuming both for mappers and reducers: ref: http://hadoop.apache.org/docs/r2.5.2/api/org/apache/hadoop/mapreduce/Job.html#setSpeculativeExecution(boolean) )