Skip to content

Thread safe reads for aggregators in IncrementalIndex#3956

Closed
nishantmonu51 wants to merge 2 commits intoapache:masterfrom
nishantmonu51:test-concurrency
Closed

Thread safe reads for aggregators in IncrementalIndex#3956
nishantmonu51 wants to merge 2 commits intoapache:masterfrom
nishantmonu51:test-concurrency

Conversation

@nishantmonu51
Copy link
Member

Aggregators are NOT thread safe and if two threads concurrently try
to read/write to the aggregator the reader may read absurd values since
the aggregate method is not atomic.

In case of IncrementalIndex the writes are protected by a sync block
but the reads are unprotected, so its possible for the queries to read
absurd values in aggregator.get().

This PR adds a test that can reproduce that behavior by wrapping
Aggregators inside a ThreadSafetyAssertionAggregator.

TODO: test any performance impacts.

Aggregators are *NOT* thread safe and if two threads concurrently try
to read/write to the aggregator the reader may read absurd values since
the aggregate method is not *atomic*.

In case of IncrementalIndex the writes are protected by a sync block
but the reads are unprotected, so its possible for the queries to read
absurd values in aggregator.get().
This PR adds a test that can reproduce that behavior by wrapping
Aggregators inside a ThreadSafetyAssertionAggregator.

TODO: test any performance impacts.
public void aggregate()
{
delegate1.aggregate();
Thread.yield();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we replace yield with Thread.sleep(1), the issue is reproduced more frequently but this slows down the test considerably as it is done for every aggregate call.

@gianm
Copy link
Contributor

gianm commented Feb 21, 2017

See also #3578

@nishantmonu51
Copy link
Member Author

Ah, the docs for HLL say its thread safe so i was wondering if that might cause issues there.
If they are safe to access then this should be fine. Closing the PR for now.

@gianm
Copy link
Contributor

gianm commented Feb 21, 2017

Well, it's definitely sketchy to be calling aggregate and get concurrently. HyperLogLogCollector isn't thread safe. It's possible that you'll get bizarre values from time to time, like if the offset of an HLLC is in the process of being incremented in one thread while it's being read in another thread. So I think this PR has value.

@nishantmonu51 nishantmonu51 reopened this Feb 21, 2017
Copy link
Member

@leventov leventov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronization should be done inside aggregators, because simple aggregators could use cheaper atomics instead of intrinsic locks. If in some use cases aggregators don't need synchronization at all, we can add methods like doAggregateConcurrent() and getConcurrent().

@leventov
Copy link
Member

Also if there is just one writing thread, synchronization inside simple aggregators (long/double/float) is not needed at all.

@gianm
Copy link
Contributor

gianm commented Jul 19, 2018

I was just looking at this issue again after the conversations on the mailing list about sketch synchronization: https://lists.apache.org/thread.html/9899aa790a7eb561ab66f47b35c8f66ffe695432719251351339521a@%3Cdev.druid.apache.org%3E

I was wondering, does it make more sense for thread-safety here to be handled systematically (at the IncrementalIndex) or for each aggregator to be thread safe? Currently we do different approaches: the sketch aggregators endeavor to be thread-safe on their own. The primitive aggregators don't bother to even try, and they're probably fine, since they're primitives. HyperLogLogAggregator tries a little bit -- it at least makes sure the different calls use different buffer objects -- but I bet it has a bug where "get" could potentially read something weird and corrupt in some rare situations. (Like if the offset is being updated while a "get" is going on.)

@leventov
Copy link
Member

Dealing with this issue systematically means taking the most conservative approach - synchronization, while some aggregators could definitely do better (lock-free)

@Eshcar
Copy link

Eshcar commented Jul 22, 2018

Synchronization should be done inside aggregators, because simple aggregators could use cheaper atomics instead of intrinsic locks.

I agree. This is what we hope to do now in sketches-core, add concurrent (thread-safe) sketches that use lightweight synchronization. The first step is adding concurrent theta sketch, which can be followed by a concurrent union implementation. Later additional concurrent sketches can be added to the library (we already have an implementation of a concurrent quantile sketch).

@himanshug
Copy link
Contributor

himanshug commented Jul 24, 2018

In general, I agree with @leventov here because different aggregators can handle concurrency with varying degree of efficiency.
Unless, of course, there is a systematic way to do things that takes care of above e.g. introducing "boolean isThreadSafe()" method or something like that on Aggregator and then based on the answer, handle things correctly in IncrementalIndex. Then Aggregators can make the choice.
Or else, I think aggregators not handling it properly are just buggy and should be fixed. Maybe update the aggregator doc with some blurbs on thread safety requirements.

That said, we need synchronization only for realtime indexing code path and historical nodes pay the penalty of thread safety unnecessarily. If we could do something systematic to change the two code paths in some way that allows historicals not paying for thread safety, that would be good.

@gianm
Copy link
Contributor

gianm commented Jan 18, 2019

The isThreadSafe() method sounds like an interesting approach. Or: maybe a asThreadSafe() method on Aggregator / BufferAggregator that returns a thread-safe clone of the aggregator. Some might return this and some might return a new impl that synchronizes stuff.

The idea would be to avoid using synchronization for cases where it's not necessary, like usage of aggregators by query engines.

@stale
Copy link

stale bot commented Mar 19, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Mar 19, 2019
@stale
Copy link

stale bot commented Mar 26, 2019

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Mar 26, 2019
@stale
Copy link

stale bot commented Aug 27, 2019

This pull request/issue is no longer marked as stale.

@stale stale bot removed the stale label Aug 27, 2019
@stale
Copy link

stale bot commented Oct 26, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Oct 26, 2019
@stale
Copy link

stale bot commented Nov 23, 2019

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants