Thread safe reads for aggregators in IncrementalIndex by nishantmonu51 · Pull Request #3956 · apache/druid

nishantmonu51 · 2017-02-21T17:39:45Z

Aggregators are NOT thread safe and if two threads concurrently try
to read/write to the aggregator the reader may read absurd values since
the aggregate method is not atomic.

In case of IncrementalIndex the writes are protected by a sync block
but the reads are unprotected, so its possible for the queries to read
absurd values in aggregator.get().

This PR adds a test that can reproduce that behavior by wrapping
Aggregators inside a ThreadSafetyAssertionAggregator.

TODO: test any performance impacts.

Aggregators are *NOT* thread safe and if two threads concurrently try to read/write to the aggregator the reader may read absurd values since the aggregate method is not *atomic*. In case of IncrementalIndex the writes are protected by a sync block but the reads are unprotected, so its possible for the queries to read absurd values in aggregator.get(). This PR adds a test that can reproduce that behavior by wrapping Aggregators inside a ThreadSafetyAssertionAggregator. TODO: test any performance impacts.

nishantmonu51 · 2017-02-21T17:41:39Z

processing/src/test/java/io/druid/segment/data/ThreadSafetyAssertingAggregatorFactory.java

+      public void aggregate()
+      {
+        delegate1.aggregate();
+        Thread.yield();


If we replace yield with Thread.sleep(1), the issue is reproduced more frequently but this slows down the test considerably as it is done for every aggregate call.

gianm · 2017-02-21T17:46:50Z

See also #3578

nishantmonu51 · 2017-02-21T17:52:16Z

Ah, the docs for HLL say its thread safe so i was wondering if that might cause issues there.
If they are safe to access then this should be fine. Closing the PR for now.

gianm · 2017-02-21T17:55:37Z

Well, it's definitely sketchy to be calling aggregate and get concurrently. HyperLogLogCollector isn't thread safe. It's possible that you'll get bizarre values from time to time, like if the offset of an HLLC is in the process of being incremented in one thread while it's being read in another thread. So I think this PR has value.

leventov

Synchronization should be done inside aggregators, because simple aggregators could use cheaper atomics instead of intrinsic locks. If in some use cases aggregators don't need synchronization at all, we can add methods like doAggregateConcurrent() and getConcurrent().

leventov · 2017-02-22T01:08:55Z

Also if there is just one writing thread, synchronization inside simple aggregators (long/double/float) is not needed at all.

gianm · 2018-07-19T19:14:40Z

I was just looking at this issue again after the conversations on the mailing list about sketch synchronization: https://lists.apache.org/thread.html/9899aa790a7eb561ab66f47b35c8f66ffe695432719251351339521a@%3Cdev.druid.apache.org%3E

I was wondering, does it make more sense for thread-safety here to be handled systematically (at the IncrementalIndex) or for each aggregator to be thread safe? Currently we do different approaches: the sketch aggregators endeavor to be thread-safe on their own. The primitive aggregators don't bother to even try, and they're probably fine, since they're primitives. HyperLogLogAggregator tries a little bit -- it at least makes sure the different calls use different buffer objects -- but I bet it has a bug where "get" could potentially read something weird and corrupt in some rare situations. (Like if the offset is being updated while a "get" is going on.)

leventov · 2018-07-19T19:27:08Z

Dealing with this issue systematically means taking the most conservative approach - synchronization, while some aggregators could definitely do better (lock-free)

Eshcar · 2018-07-22T14:04:48Z

Synchronization should be done inside aggregators, because simple aggregators could use cheaper atomics instead of intrinsic locks.

I agree. This is what we hope to do now in sketches-core, add concurrent (thread-safe) sketches that use lightweight synchronization. The first step is adding concurrent theta sketch, which can be followed by a concurrent union implementation. Later additional concurrent sketches can be added to the library (we already have an implementation of a concurrent quantile sketch).

himanshug · 2018-07-24T18:15:25Z

In general, I agree with @leventov here because different aggregators can handle concurrency with varying degree of efficiency.
Unless, of course, there is a systematic way to do things that takes care of above e.g. introducing "boolean isThreadSafe()" method or something like that on Aggregator and then based on the answer, handle things correctly in IncrementalIndex. Then Aggregators can make the choice.
Or else, I think aggregators not handling it properly are just buggy and should be fixed. Maybe update the aggregator doc with some blurbs on thread safety requirements.

That said, we need synchronization only for realtime indexing code path and historical nodes pay the penalty of thread safety unnecessarily. If we could do something systematic to change the two code paths in some way that allows historicals not paying for thread safety, that would be good.

gianm · 2019-01-18T19:18:51Z

The isThreadSafe() method sounds like an interesting approach. Or: maybe a asThreadSafe() method on Aggregator / BufferAggregator that returns a thread-safe clone of the aggregator. Some might return this and some might return a new impl that synchronizes stuff.

The idea would be to avoid using synchronization for cases where it's not necessary, like usage of aggregators by query engines.

stale · 2019-03-19T19:46:29Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-03-26T20:01:07Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

stale · 2019-08-27T19:43:21Z

This pull request/issue is no longer marked as stale.

stale · 2019-10-26T20:38:18Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2019-11-23T21:33:52Z

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

nishantmonu51 added the Discuss label Feb 21, 2017

nishantmonu51 commented Feb 21, 2017

View reviewed changes

Add license

8e2e670

nishantmonu51 closed this Feb 21, 2017

nishantmonu51 reopened this Feb 21, 2017

leventov requested changes Feb 22, 2017

View reviewed changes

leventov self-assigned this Feb 22, 2017

gianm mentioned this pull request Mar 8, 2017

groupBy v2 failing intermittently with complex columns #4026

Closed

leventov mentioned this pull request May 22, 2017

Workaround for non-thread-safe use of CardinalityAggregator. #4304

Merged

leventov mentioned this pull request Mar 2, 2018

ArrayOfDoublesSketch module #5148

Merged

leventov mentioned this pull request Apr 29, 2018

Oak: New Concurrent Key-Value Map #5698

Closed

clambertus unassigned leventov Jul 6, 2018

This was referenced Jan 15, 2019

datsketches extension updated to use the latest sketches-core-0.12.0 #6381

Merged

Moments Sketch custom aggregator #6581

Merged

Adds bloom filter aggregator to 'druid-bloom-filters' extension #6397

Merged

Fixed buckets histogram aggregator #6638

Merged

stale bot added the stale label Mar 19, 2019

stale bot closed this Mar 26, 2019

clintropolis mentioned this pull request Jul 21, 2019

remove unnecessary lock in ForegroundCachePopulator leading to a lot of contention #8116

Merged

2 tasks

leventov added the Area - Streaming Ingestion label Aug 27, 2019

leventov reopened this Aug 27, 2019

stale bot removed the stale label Aug 27, 2019

stale bot added the stale label Oct 26, 2019

stale bot closed this Nov 23, 2019

clintropolis mentioned this pull request Apr 16, 2021

Vectorized versions of HllSketch aggregators. #11115

Merged

Conversation

nishantmonu51 commented Feb 21, 2017

Uh oh!

nishantmonu51 Feb 21, 2017

Choose a reason for hiding this comment

Uh oh!

gianm commented Feb 21, 2017

Uh oh!

nishantmonu51 commented Feb 21, 2017

Uh oh!

gianm commented Feb 21, 2017

Uh oh!

leventov left a comment

Choose a reason for hiding this comment

Uh oh!

leventov commented Feb 22, 2017

Uh oh!

gianm commented Jul 19, 2018

Uh oh!

leventov commented Jul 19, 2018

Uh oh!

Eshcar commented Jul 22, 2018

Uh oh!

himanshug commented Jul 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 18, 2019

Uh oh!

stale bot commented Mar 19, 2019

Uh oh!

stale bot commented Mar 26, 2019

Uh oh!

stale bot commented Aug 27, 2019

Uh oh!

stale bot commented Oct 26, 2019

Uh oh!

stale bot commented Nov 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

himanshug commented Jul 24, 2018 •

edited

Loading