make bitmap index usage optional for bloom filters#6633
make bitmap index usage optional for bloom filters#6633clintropolis wants to merge 3 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
There is no need to add this blank line.
There was a problem hiding this comment.
It would be better if you could add the @Nullable annotation to this method.
|
@clintropolis is this #3878? |
|
I think figuring whether to use a bitmap index or not, would be bit too much for the clients sending the queries for making an optimal decision on. Instead of this, how about adding a config for cardinality threshold in bloom filter for using the bitmap index and we only use bitmaps for columns with cardinality below the configured threshold. Additionally, for more flexibility, we can make this threshold overridable via query context. |
|
I agree that there should be a threshold rather than flag. However global config doesn't make a lot of sense. There should be a heuristic (in the future the heuristic could be aided with constantly sized info written in the segment format), overridable through a query context. |
I think it's trying to achieve the same goal, though this PR is very primitive and manually controlled.
Yeah, I made the parameter optional and defaulting to skipping indexes to err on the side of likely better performance, but yeah I think you're probably right that it's not useful unless the user is deeply familiar with both the data and the implications of using them or not during query processing. I was planning on following this up with investigation on what a good threshold would be to automatically set this value, so that way at least a way to control this behavior would be out sooner than later and since I think adaptively using bitmaps or not based on cardinality is probably useful for a lot of filters beyond the bloom filter, but I think I will go ahead and begin to look into it and think about a general solution for #3878 as soon as I have a chance based on feedback so far. |
…lse for better performance
01f860b to
c992415
Compare
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
Testing with large filters against higher cardinality dimensions yields better performance if we skip using bitmap indexes during filtering in my experiments thus far. Making the default value for the newly introduced
useBitmapIndexparameter tofalseis making the assumption that these sorts of filters will be more useful on such higher cardinality dimensions, but I can default the value totrueto preserve existing behavior if preferred, and I think optimally this value would probably be adaptive based on cardinality.Metrics collected by continuously running identical bloom filter queries differing only in the value for
useBitmapIndexfor a period of time.query/cpu/time

query/time

query/segment/time

In my tests, the effect is more dramatic when the bloom filter is combined with additional filters as part of an
andfilter, I will try to produce metrics to show this as well when I have the chance.