Workaround non-thread-safe use of HLL aggregators. by gianm · Pull Request #3578 · apache/druid

gianm · 2016-10-15T18:41:39Z

Despite the non-thread-safety of HyperLogLogCollector, it is actually currently used
by multiple threads during realtime indexing. HyperUniquesAggregator's "aggregate" and
"get" methods can be called simultaneously by OnheapIncrementalIndex, since its
"doAggregate" and "getMetricObjectValue" methods are not synchronized.

This means that the optimization of HyperLogLogCollector.fold in #3314 (saving and
restoring position rather than duplicating the storage buffer of the right-hand side)
could cause corruption in the face of concurrent writes.

This patch works around the issue by duplicating the storage buffer in "get" before
returning a collector. The returned collector still shares data with the original one,
but the situation is no worse than before #3314. In the future we may want to consider
making a thread safe version of HLLC that avoids these kinds of problems in realtime
indexing. But for now I thought it was best to do a small change that restored the old
behavior.

Fixes #3560.

Despite the non-thread-safety of HyperLogLogCollector, it is actually currently used by multiple threads during realtime indexing. HyperUniquesAggregator's "aggregate" and "get" methods can be called simultaneously by OnheapIncrementalIndex, since its "doAggregate" and "getMetricObjectValue" methods are not synchronized. This means that the optimization of HyperLogLogCollector.fold in apache#3314 (saving and restoring position rather than duplicating the storage buffer of the right-hand side) could cause corruption in the face of concurrent writes. This patch works around the issue by duplicating the storage buffer in "get" before returning a collector. The returned collector still shares data with the original one, but the situation is no worse than before apache#3314. In the future we may want to consider making a thread safe version of HLLC that avoids these kinds of problems in realtime indexing. But for now I thought it was best to do a small change that restored the old behavior.

drcrallen · 2016-10-15T19:32:41Z

processing/src/main/java/io/druid/query/aggregation/hyperloglog/HyperUniquesAggregator.java

  {
-    return collector;
+    // Workaround for OnheapIncrementalIndex's penchant for calling "aggregate" and "get" simultaneously.
+    return HyperLogLogCollector.makeCollector(collector.getStorageBuffer().duplicate());


HLL is already super expensive to calculate, can you get some benchmarks on how this effects both ingestion and querying?

Sure; do you have any feedback other than that?

@drcrallen posted benchmarks; looks pretty similar before and after. This makes sense to me since get is not called very often compared to aggregate.

gianm · 2016-10-15T20:36:40Z

Query benchmarks, including a benchmark with an extraction dim spec to force the DimExtractionTopNAlgorithm (which uses Aggregator rather than BufferAggregator):

fix-3560: topN on HLL

Benchmark                                  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
TopNBenchmark.queryMultiQueryableIndex                 1            750000           basic.A           10  avgt   25   118507.045 ±  4529.905  us/op
TopNBenchmark.querySingleIncrementalIndex              1            750000           basic.A           10  avgt   25  1045553.387 ± 49598.109  us/op
TopNBenchmark.querySingleQueryableIndex                1            750000           basic.A           10  avgt   25   103220.257 ±  4797.120  us/op

fix-3560: topN on HLL + extraction dimension

Benchmark                                  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
TopNBenchmark.queryMultiQueryableIndex                 1            750000           basic.A           10  avgt   25   104743.297 ±  4584.774  us/op
TopNBenchmark.querySingleIncrementalIndex              1            750000           basic.A           10  avgt   25  1051282.036 ± 42161.144  us/op
TopNBenchmark.querySingleQueryableIndex                1            750000           basic.A           10  avgt   25   119610.054 ±  4121.940  us/op

0.9.2: topN on HLL

Benchmark                                  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
TopNBenchmark.queryMultiQueryableIndex                 1            750000           basic.A           10  avgt   25   115235.941 ±  5724.145  us/op
TopNBenchmark.querySingleIncrementalIndex              1            750000           basic.A           10  avgt   25  1039008.121 ± 41863.208  us/op
TopNBenchmark.querySingleQueryableIndex                1            750000           basic.A           10  avgt   25   113842.340 ±  4576.450  us/op

0.9.2: topN on HLL + extraction dimension

Benchmark                                  (numSegments)  (rowsPerSegment)  (schemaAndQuery)  (threshold)  Mode  Cnt        Score       Error  Units
TopNBenchmark.queryMultiQueryableIndex                 1            750000           basic.A           10  avgt   25   114115.535 ± 16587.212  us/op
TopNBenchmark.querySingleIncrementalIndex              1            750000           basic.A           10  avgt   25  1047492.787 ± 43123.218  us/op
TopNBenchmark.querySingleQueryableIndex                1            750000           basic.A           10  avgt   25   114383.215 ±  4976.573  us/op

gianm · 2016-10-15T20:49:33Z

Index persist benchmarks:

fix-3560: 

Benchmark                        (rollup)  (rowsPerSegment)  (schema)  Mode  Cnt        Score       Error  Units
IndexPersistBenchmark.persistV9      true             75000     basic  avgt   25  1083560.470 ± 41714.967  us/op
IndexPersistBenchmark.persistV9     false             75000     basic  avgt   25  1118639.035 ± 39750.698  us/op

0.9.2:

Benchmark                        (rollup)  (rowsPerSegment)  (schema)  Mode  Cnt        Score       Error  Units
IndexPersistBenchmark.persistV9      true             75000     basic  avgt   25  1097836.528 ± 43535.585  us/op
IndexPersistBenchmark.persistV9     false             75000     basic  avgt   25  1161545.577 ± 36522.730  us/op

I don't think other indexing benchmarks are needed since get is only called during persisting.

fjy · 2016-10-17T14:39:45Z

👍

drcrallen · 2016-10-17T15:12:37Z

👍

drcrallen · 2016-10-17T15:12:51Z

Waiting for travis

Despite the non-thread-safety of HyperLogLogCollector, it is actually currently used by multiple threads during realtime indexing. HyperUniquesAggregator's "aggregate" and "get" methods can be called simultaneously by OnheapIncrementalIndex, since its "doAggregate" and "getMetricObjectValue" methods are not synchronized. This means that the optimization of HyperLogLogCollector.fold in apache#3314 (saving and restoring position rather than duplicating the storage buffer of the right-hand side) could cause corruption in the face of concurrent writes. This patch works around the issue by duplicating the storage buffer in "get" before returning a collector. The returned collector still shares data with the original one, but the situation is no worse than before apache#3314. In the future we may want to consider making a thread safe version of HLLC that avoids these kinds of problems in realtime indexing. But for now I thought it was best to do a small change that restored the old behavior.

Despite the non-thread-safety of HyperLogLogCollector, it is actually currently used by multiple threads during realtime indexing. HyperUniquesAggregator's "aggregate" and "get" methods can be called simultaneously by OnheapIncrementalIndex, since its "doAggregate" and "getMetricObjectValue" methods are not synchronized. This means that the optimization of HyperLogLogCollector.fold in #3314 (saving and restoring position rather than duplicating the storage buffer of the right-hand side) could cause corruption in the face of concurrent writes. This patch works around the issue by duplicating the storage buffer in "get" before returning a collector. The returned collector still shares data with the original one, but the situation is no worse than before #3314. In the future we may want to consider making a thread safe version of HLLC that avoids these kinds of problems in realtime indexing. But for now I thought it was best to do a small change that restored the old behavior.

Despite the non-thread-safety of HyperLogLogCollector, it is actually currently used by multiple threads during realtime indexing. HyperUniquesAggregator's "aggregate" and "get" methods can be called simultaneously by OnheapIncrementalIndex, since its "doAggregate" and "getMetricObjectValue" methods are not synchronized. This means that the optimization of HyperLogLogCollector.fold in apache#3314 (saving and restoring position rather than duplicating the storage buffer of the right-hand side) could cause corruption in the face of concurrent writes. This patch works around the issue by duplicating the storage buffer in "get" before returning a collector. The returned collector still shares data with the original one, but the situation is no worse than before apache#3314. In the future we may want to consider making a thread safe version of HLLC that avoids these kinds of problems in realtime indexing. But for now I thought it was best to do a small change that restored the old behavior.

gianm added the Bug label Oct 15, 2016

gianm added this to the 0.9.2 milestone Oct 15, 2016

gianm mentioned this pull request Oct 15, 2016

BufferUnderflowException in HyperLogLogCollector.fold #3560

Closed

gianm closed this Oct 15, 2016

gianm reopened this Oct 15, 2016

drcrallen requested changes Oct 15, 2016

View reviewed changes

fjy closed this Oct 17, 2016

fjy reopened this Oct 17, 2016

leventov approved these changes Oct 17, 2016

View reviewed changes

fjy merged commit 285516b into apache:master Oct 17, 2016

gianm mentioned this pull request Oct 17, 2016

[Backport] Workaround non-thread-safe use of HLL aggregators. #3583

Merged

gianm mentioned this pull request Feb 21, 2017

Thread safe reads for aggregators in IncrementalIndex #3956

Closed

gianm mentioned this pull request Mar 8, 2017

groupBy v2 failing intermittently with complex columns #4026

Closed

gianm mentioned this pull request Apr 24, 2017

BufferUnderflowException when using HLL in TopN query #4199

Closed

jerchung mentioned this pull request May 18, 2017

HLL BufferUnderflowException querying realtime indexing tasks #4296

Closed

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Feb 25, 2022

apache#3578 Allow null timestamp for CTAS

da1ae1b

gianm deleted the fix-3560 branch September 23, 2022 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround non-thread-safe use of HLL aggregators.#3578

Workaround non-thread-safe use of HLL aggregators.#3578
fjy merged 1 commit intoapache:masterfrom
gianm:fix-3560

gianm commented Oct 15, 2016

Uh oh!

drcrallen Oct 15, 2016

Uh oh!

gianm Oct 15, 2016

Uh oh!

gianm Oct 15, 2016

Uh oh!

gianm commented Oct 15, 2016

Uh oh!

gianm commented Oct 15, 2016

Uh oh!

fjy commented Oct 17, 2016

Uh oh!

drcrallen commented Oct 17, 2016

Uh oh!

drcrallen commented Oct 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gianm commented Oct 15, 2016

Uh oh!

drcrallen Oct 15, 2016

Choose a reason for hiding this comment

Uh oh!

gianm Oct 15, 2016

Choose a reason for hiding this comment

Uh oh!

gianm Oct 15, 2016

Choose a reason for hiding this comment

Uh oh!

gianm commented Oct 15, 2016

Uh oh!

gianm commented Oct 15, 2016

Uh oh!

fjy commented Oct 17, 2016

Uh oh!

drcrallen commented Oct 17, 2016

Uh oh!

drcrallen commented Oct 17, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants