-
Notifications
You must be signed in to change notification settings - Fork 22
Dynamic filters blog post #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I plan to review this PR first thing tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive blog post about DataFusion's dynamic filters optimization feature. The post explains how dynamic filters enable sideways information passing between operators to achieve significant query performance improvements.
- Introduces dynamic filters concept and implementation for TopK and Hash Join operators
- Documents performance improvements showing up to 22x speedups for certain query patterns
- Provides technical implementation details and future work considerations
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| └───────────────────────────┘ | ||
| ``` | ||
|
|
||
| ## Implementation for Hash Join Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some perf results as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put something together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to either find an inner join query where our current optimization can be applied and shows significant benefit or we wait until the other join cases are implemented.
|
|
||
| * Support for more types of joins: we only implemented support for hash inner joins so far. There's the potential to expand this to other join types both in terms of the physical implementation (nested loop joins, etc.) and join type (e.g. left outer joins, cross joins, etc.). | ||
| * Push down entire hash tables to the scan operator: this could potentially help a lot with join keys that are not naturally ordered or have a lot of skew. | ||
| * Use file level statistics to order files to match the `ORDER BY` clause as best we can: this will help TopK dynamic filters be more effective by skipping more work earlier in the scan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add the idea to have a single / merged heap for topk to make the pushdown more selective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided not to include it because I'm hoping we merge apache/datafusion#16433 before this blog post is published
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
|
Thank you very much for this post @adriangb It looks like there are some layout / visual things that could be improved (likely due to the quirky pelican markdown rendering)
Also, I really like the hook / explanation of the pydantic usecase and the results achieved. I think this blog would be even stronger (aka likely to attract a wider audience readers) with a more generic background on dynamic filtering before the discussion of how they work in DataFusion -- In general I have found that blogs get more readers and attention if they teach more generic database concepts first and then talk about DataFusion specificailly (some ideas are here apache/datafusion#15513 (comment)) Some other thoughts:
|
|
@adriangb I am happy to help (or just do) any of the above suggestions, but I didn't want to take over and push a bunch of changes to this blog without checking with you first |
Please go ahead and push! Maybe you can handle those bits and I'll work on getting a figure for join performance? |
Will do -- ✍️ 🤓 |
|
I started cleaning this PR up a bit, but I couldn't push commits to it as it was in the pydantic fork, and I don't have permissions So instead, I made a new PR based on a branch in this repo Among other benefits, you can now see the previewed version on the staged site: https://datafusion.staged.apache.org/blog/ |
|
Let's keep working on #103 so closing this PR |

Uh oh!
There was an error while loading. Please reload this page.