Vectorized versions of HllSketch aggregators.#11115
Conversation
The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode.
| public void testThetaSketchPostAggs() throws Exception | ||
| { | ||
| // Can't vectorize due to CONCAT expression. | ||
| cannotVectorize(); |
There was a problem hiding this comment.
hmm, after #11010 CONCAT should be vectorizable, did this fail to vectorize? (You might need to add context to vectorize expressions, I can't remember off the top of my head for tests but it isn't enabled by default yet...)
There was a problem hiding this comment.
Adding the context parameter enabled this test to pass — thanks.
| public void testHllSketchPostAggs() throws Exception | ||
| { | ||
| // Can't vectorize due to CONCAT expression. | ||
| cannotVectorize(); |
There was a problem hiding this comment.
same comment about vectorized concat
| /** | ||
| * Utility for locking positions in a buffer. | ||
| */ | ||
| public class StripedBufferLock |
There was a problem hiding this comment.
I wonder does maybe have wider utility than just this extension? Nothing seems specific to datasketches, should we move this to core to make this available to other aggregators? Though, if we drop the locks from buffer/vector aggs, maybe it isn't needed.
There was a problem hiding this comment.
I deleted it from the latest commit based on the conversation in #11115 (comment).
| { | ||
| final WritableMemory mem = WritableMemory.wrap(buf, ByteOrder.LITTLE_ENDIAN).writableRegion(position, size); | ||
| final Lock lock = stripedLock.getLock(position).readLock(); | ||
| lock.lock(); |
There was a problem hiding this comment.
Hmm, based on discussion in #6381, #3956, #5148 (comment), and a few other places, while Aggregator definitely needs to be threadsafe, i think BufferAggregator only needs it for OffHeapIncrementalIndex (and oak incremental index, https://github.com/apache/druid/pull/10001/files#diff-8a952057b2e59239cbd4cbff14c3202ea7d8cd85b4bb4c174d9375b00a0e7b40R384, if we get that in).
Given that there are many buffer aggregators which are not thread-safe, I think it might make sense to drop this for now, and revisit it as part of #10001, since OffHeapIncremental index isn't really used at ingestion time afaik.
There was a problem hiding this comment.
Makes sense to me. I'll get rid of the locking, and include a comment about where to get it back if we need to revive it in the future.
There was a problem hiding this comment.
See also: #11124, which was inspired by this thread and clears up some stuff.
| @Test | ||
| public void testApproxCountDistinctHllSketch() throws Exception | ||
| { | ||
| // Can't vectorize due to CONCAT expression. |
There was a problem hiding this comment.
super nit: but this is actually due to SUBSTRING i think
There was a problem hiding this comment.
btw, I definitely don't think this is worth fixing in this PR, we can just fix whenever we remove this when we add more vectorized string expressions
Also removes synchronization for the BufferAggregator and VectorAggregator implementations, since it is not necessary (similar to apache#11115). Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest to run all test cases in vectorized mode.
* Vectorize the DataSketches quantiles aggregator. Also removes synchronization for the BufferAggregator and VectorAggregator implementations, since it is not necessary (similar to #11115). Extends DoublesSketchAggregatorTest and DoublesSketchSqlAggregatorTest to run all test cases in vectorized mode. * Style fix.
* Update security overview with additional recommendations (apache#11016) * updatee security overview with additional recommendations for improved security * address first set of review questions * Update docs/operations/security-overview.md * Update docs/operations/security-overview.md * apply changes from review * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update docs/operations/security-overview.md Co-authored-by: Suneet Saldanha <suneet@apache.org> * Update security-overview.md fix additional comments & typos cc: @suneet-s, @jihoonsoon Co-authored-by: Suneet Saldanha <suneet@apache.org> * Enable rewriting certain inner joins as filters. (apache#11068) * Enable rewriting certain inner joins as filters. The main logic for doing the rewrite is in JoinableFactoryWrapper's segmentMapFn method. The requirements are: - It must be an inner equi-join. - The right-hand columns referenced by the condition must not contain any duplicate values. (If they did, the inner join would not be guaranteed to return at most one row for each left-hand-side row.) - No columns from the right-hand side can be used by anything other than the join condition itself. HashJoinSegmentStorageAdapter is also modified to pass through to the base adapter (even allowing vectorization!) in the case where 100% of join clauses could be rewritten as filters. In support of this goal: - Add Query getRequiredColumns() method to help us figure out whether the right-hand side of a join datasource is being used or not. - Add JoinConditionAnalysis getRequiredColumns() method to help us figure out if the right-hand side of a join is being used by later join clauses acting on the same base. - Add Joinable getNonNullColumnValuesIfAllUnique method to enable retrieving the set of values that will form the "in" filter. - Add LookupExtractor canGetKeySet() and keySet() methods to support LookupJoinable in its efforts to implement the new Joinable method. - Add "enableRewriteJoinToFilter" feature flag to JoinFilterRewriteConfig. The default is disabled. * Test improvements. * Test fixes. * Avoid slow size() call. * Remove invalid test. * Fix style. * Fix mistaken default. * Small fixes. * Fix logic error. * Doc updates for union datasources. (apache#11103) The main one is updating datasources.md to talk about SQL. (It still said that table unions are not supported in SQL.) Also, this doc update adds some clarifying details on limitations. * [Security] Bump netty4.version from 4.1.48.Final to 4.1.63.Final (apache#11117) * Vectorized versions of HllSketch aggregators. (apache#11115) * Vectorized versions of HllSketch aggregators. The patch uses the same "helper" approach as apache#10767 and apache#10304, and extends the tests to run in both vectorized and non-vectorized modes. Also includes some minor changes to the theta sketch vector aggregator: - Cosmetic changes to make the hll and theta implementations look more similar. - Extends the theta SQL tests to run in vectorized mode. * Updates post-code-review. * Fix javadoc. * Web console: update dev dependencies (apache#11119) * Update some dev dependencies, prettify, tslint-fix * Sort tsconfig keys for easy comparison * Set noImplicitThis * Slightly more accurate types * Bump Jest and related * Bump react to latest on v16 * Bump node-sass, sass-loader for node14 support * Remove node-sass-chokidar (unused) * More unused dependencies * Fix blueprint imports * Webpack 5 * Update webpack config for 'process' usage * Update playwright-chromium * Emit esnext modules for tree shaking * Enable source maps in development * Dedupe * Bump babel and things * npm audit fix * Add .editorconfig file to match prettier settings * Update licenses (tslib is 0BSD as of 1.11.2) microsoft/tslib#96 * Require node >= 10 * Use Node 10 to run e2e tests * Use 'ws' transport mode for dev server (will be default in next version) * Remove an 'any' * No sourcemaps in prod * Exclude .editorconfig from license checks * Try nvm for setting node version Co-authored-by: Charles Smith <38529548+techdocsmith@users.noreply.github.com> Co-authored-by: Suneet Saldanha <suneet@apache.org> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Gian Merlino <gianmerlino@gmail.com> Co-authored-by: Sandeep <isandeep41@gmail.com> Co-authored-by: John Gozde <john@gozde.ca>
The patch uses the same "helper" approach as #10767 and #10304, and
extends the tests to run in both vectorized and non-vectorized modes.
Also includes some minor changes to the theta sketch vector aggregator:
more similar.