Skip to content

Vectorized versions of HllSketch aggregators.#11115

Merged
gianm merged 3 commits intoapache:masterfrom
gianm:performance-vectorize-datasketches-hll
Apr 17, 2021
Merged

Vectorized versions of HllSketch aggregators.#11115
gianm merged 3 commits intoapache:masterfrom
gianm:performance-vectorize-datasketches-hll

Conversation

@gianm
Copy link
Contributor

@gianm gianm commented Apr 14, 2021

The patch uses the same "helper" approach as #10767 and #10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

  • Cosmetic changes to make the hll and theta implementations look
    more similar.
  • Extends the theta SQL tests to run in vectorized mode.

The patch uses the same "helper" approach as apache#10767 and apache#10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.
public void testThetaSketchPostAggs() throws Exception
{
// Can't vectorize due to CONCAT expression.
cannotVectorize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, after #11010 CONCAT should be vectorizable, did this fail to vectorize? (You might need to add context to vectorize expressions, I can't remember off the top of my head for tests but it isn't enabled by default yet...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the context parameter enabled this test to pass — thanks.

public void testHllSketchPostAggs() throws Exception
{
// Can't vectorize due to CONCAT expression.
cannotVectorize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about vectorized concat

/**
* Utility for locking positions in a buffer.
*/
public class StripedBufferLock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder does maybe have wider utility than just this extension? Nothing seems specific to datasketches, should we move this to core to make this available to other aggregators? Though, if we drop the locks from buffer/vector aggs, maybe it isn't needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted it from the latest commit based on the conversation in #11115 (comment).

{
final WritableMemory mem = WritableMemory.wrap(buf, ByteOrder.LITTLE_ENDIAN).writableRegion(position, size);
final Lock lock = stripedLock.getLock(position).readLock();
lock.lock();
Copy link
Member

@clintropolis clintropolis Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, based on discussion in #6381, #3956, #5148 (comment), and a few other places, while Aggregator definitely needs to be threadsafe, i think BufferAggregator only needs it for OffHeapIncrementalIndex (and oak incremental index, https://github.com/apache/druid/pull/10001/files#diff-8a952057b2e59239cbd4cbff14c3202ea7d8cd85b4bb4c174d9375b00a0e7b40R384, if we get that in).

Given that there are many buffer aggregators which are not thread-safe, I think it might make sense to drop this for now, and revisit it as part of #10001, since OffHeapIncremental index isn't really used at ingestion time afaik.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. I'll get rid of the locking, and include a comment about where to get it back if we need to revive it in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: #11124, which was inspired by this thread and clears up some stuff.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@Test
public void testApproxCountDistinctHllSketch() throws Exception
{
// Can't vectorize due to CONCAT expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: but this is actually due to SUBSTRING i think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I definitely don't think this is worth fixing in this PR, we can just fix whenever we remove this when we add more vectorized string expressions

@gianm gianm merged commit f2b54de into apache:master Apr 17, 2021
@gianm gianm deleted the performance-vectorize-datasketches-hll branch April 17, 2021 01:45
gianm added a commit to gianm/druid that referenced this pull request Apr 29, 2021
Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to apache#11115).

Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.
clintropolis pushed a commit that referenced this pull request May 2, 2021
* Vectorize the DataSketches quantiles aggregator.

Also removes synchronization for the BufferAggregator and VectorAggregator
implementations, since it is not necessary (similar to #11115).

Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest
to run all test cases in vectorized mode.

* Style fix.
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 22, 2021
* Update security overview with additional recommendations (apache#11016)

* updatee security overview with additional recommendations for improved security

* address first set of review questions

* Update docs/operations/security-overview.md

* Update docs/operations/security-overview.md

* apply changes from review

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update docs/operations/security-overview.md

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Update security-overview.md

fix additional comments & typos cc: @suneet-s, @jihoonsoon

Co-authored-by: Suneet Saldanha <suneet@apache.org>

* Enable rewriting certain inner joins as filters. (apache#11068)

* Enable rewriting certain inner joins as filters.

The main logic for doing the rewrite is in JoinableFactoryWrapper's
segmentMapFn method. The requirements are:

- It must be an inner equi-join.
- The right-hand columns referenced by the condition must not contain any
  duplicate values. (If they did, the inner join would not be guaranteed
  to return at most one row for each left-hand-side row.)
- No columns from the right-hand side can be used by anything other than
  the join condition itself.

HashJoinSegmentStorageAdapter is also modified to pass through to
the base adapter (even allowing vectorization!) in the case where 100%
of join clauses could be rewritten as filters.

In support of this goal:

- Add Query getRequiredColumns() method to help us figure out whether
  the right-hand side of a join datasource is being used or not.
- Add JoinConditionAnalysis getRequiredColumns() method to help us
  figure out if the right-hand side of a join is being used by later
  join clauses acting on the same base.
- Add Joinable getNonNullColumnValuesIfAllUnique method to enable
  retrieving the set of values that will form the "in" filter.
- Add LookupExtractor canGetKeySet() and keySet() methods to support
  LookupJoinable in its efforts to implement the new Joinable method.
- Add "enableRewriteJoinToFilter" feature flag to
  JoinFilterRewriteConfig. The default is disabled.

* Test improvements.

* Test fixes.

* Avoid slow size() call.

* Remove invalid test.

* Fix style.

* Fix mistaken default.

* Small fixes.

* Fix logic error.

* Doc updates for union datasources. (apache#11103)

The main one is updating datasources.md to talk about SQL. (It still said
that table unions are not supported in SQL.) Also, this doc update adds
some clarifying details on limitations.

* [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117)

* Vectorized versions of HllSketch aggregators. (apache#11115)

* Vectorized versions of HllSketch aggregators.

The patch uses the same "helper" approach as apache#10767 and apache#10304, and
extends the tests to run in both vectorized and non-vectorized modes.

Also includes some minor changes to the theta sketch vector aggregator:

- Cosmetic changes to make the hll and theta implementations look
  more similar.
- Extends the theta SQL tests to run in vectorized mode.

* Updates post-code-review.

* Fix javadoc.

* Web console: update dev dependencies (apache#11119)

* Update some dev dependencies, prettify, tslint-fix

* Sort tsconfig keys for easy comparison

* Set noImplicitThis

* Slightly more accurate types

* Bump Jest and related

* Bump react to latest on v16

* Bump node-sass, sass-loader for node14 support

* Remove node-sass-chokidar (unused)

* More unused dependencies

* Fix blueprint imports

* Webpack 5

* Update webpack config for 'process' usage

* Update playwright-chromium

* Emit esnext modules for tree shaking

* Enable source maps in development

* Dedupe

* Bump babel and things

* npm audit fix

* Add .editorconfig file to match prettier settings

* Update licenses (tslib is 0BSD as of 1.11.2)

microsoft/tslib#96

* Require node >= 10

* Use Node 10 to run e2e tests

* Use 'ws' transport mode for dev server (will be default in next version)

* Remove an 'any'

* No sourcemaps in prod

* Exclude .editorconfig from license checks

* Try nvm for setting node version

Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com>
Co-authored-by: Suneet Saldanha <suneet@apache.org>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Co-authored-by: Sandeep <isandeep41@gmail.com>
Co-authored-by: John Gozde <john@gozde.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants