Dynamic filters blog post #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

adriangb wants to merge 14 commits into apache:main from pydantic:dynamic-filters-blog

Contributor

adriangb commented Aug 19, 2025 •

edited by alamb

Loading

Closes Blog post about TopK filter pushdown datafusion#15513

adriangb added 3 commits

August 16, 2025 21:07


          start dynamic filters

b3634fb


          start blog post

5a2d721


          update image

0e7ae50

adriangb marked this pull request as ready for review

August 19, 2025 17:13

adriangb added 2 commits

August 19, 2025 18:50


          finish first draft

ccc195d


          cleanup

e1904e8

adriangb changed the title ~~DRAFT: Dynamic filters blog post~~ Dynamic filters blog post

adriangb requested review from alamb and Copilot

August 19, 2025 23:51

adriangb mentioned this pull request

Blog post about TopK filter pushdown apache/datafusion#15513

Closed

This comment was marked as outdated.

Sign in to view

Contributor

alamb commented Aug 20, 2025

I plan to review this PR first thing tomorrow

adriangb requested a review from Copilot

August 21, 2025 13:31

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull Request Overview

This PR introduces a comprehensive blog post about DataFusion's dynamic filters optimization feature. The post explains how dynamic filters enable sideways information passing between operators to achieve significant query performance improvements.

Introduces dynamic filters concept and implementation for TopK and Hash Join operators
Documents performance improvements showing up to 22x speedups for certain query patterns
Provides technical implementation details and future work considerations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

adriangb and others added 4 commits

August 21, 2025 08:34


          fix typos

bb0c116

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update content/blog/2025-08-16-dynamic-filters.md

2da25de

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update content/blog/2025-08-16-dynamic-filters.md

33fb6dd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          Update content/blog/2025-08-16-dynamic-filters.md

48a343a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md

+              └───────────────────────────┘
+              ```
+              ## Implementation for Hash Join Operator

Contributor

Dandandan Aug 21, 2025

Do we have some perf results as well here?

Contributor Author

adriangb Aug 21, 2025

I'll put something together

Contributor Author

adriangb Aug 21, 2025

I need to either find an inner join query where our current optimization can be applied and shows significant benefit or we wait until the other join cases are implemented.

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md

+              * Support for more types of joins: we only implemented support for hash inner joins so far. There's the potential to expand this to other join types both in terms of the physical implementation (nested loop joins, etc.) and join type (e.g. left outer joins, cross joins, etc.).
+              * Push down entire hash tables to the scan operator: this could potentially help a lot with join keys that are not naturally ordered or have a lot of skew.
+              * Use file level statistics to order files to match the `ORDER BY` clause as best we can: this will help TopK dynamic filters be more effective by skipping more work earlier in the scan.

Contributor

Dandandan Aug 21, 2025

Do we want to add the idea to have a single / merged heap for topk to make the pushdown more selective?

Contributor Author

adriangb Aug 21, 2025

I decided not to include it because I'm hoping we merge apache/datafusion#16433 before this blog post is published

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

Dandandan reviewed

View reviewed changes

content/blog/2025-08-16-dynamic-filters.md Outdated Show resolved Hide resolved

adriangb and others added 5 commits

August 21, 2025 09:06


          Update content/blog/2025-08-16-dynamic-filters.md

5d11fdd

Co-authored-by: Daniël Heres <danielheres@gmail.com>


          Update content/blog/2025-08-16-dynamic-filters.md

2135e2d

Co-authored-by: Daniël Heres <danielheres@gmail.com>


          Update content/blog/2025-08-16-dynamic-filters.md

0c7deb3

Co-authored-by: Daniël Heres <danielheres@gmail.com>


          Update content/blog/2025-08-16-dynamic-filters.md

b8184c8

Co-authored-by: Daniël Heres <danielheres@gmail.com>


          Update content/blog/2025-08-16-dynamic-filters.md

230a5f6

Co-authored-by: Daniël Heres <danielheres@gmail.com>

Contributor

alamb commented Aug 22, 2025

Thank you very much for this post @adriangb

It looks like there are some layout / visual things that could be improved (likely due to the quirky pelican markdown rendering)

Screenshot 2025-08-22 at 6 52 53 AM

Also, I really like the hook / explanation of the pydantic usecase and the results achieved.

I think this blog would be even stronger (aka likely to attract a wider audience readers) with a more generic background on dynamic filtering before the discussion of how they work in DataFusion -- In general I have found that blogs get more readers and attention if they teach more generic database concepts first and then talk about DataFusion specificailly (some ideas are here apache/datafusion#15513 (comment))

Some other thoughts:

I think the blog would be stronger if it lead with a diagram / visual
Perhaps we can add an "About the author" and "About DataFusion" section (similar to https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/)

Contributor

alamb commented Aug 22, 2025

@adriangb I am happy to help (or just do) any of the above suggestions, but I didn't want to take over and push a bunch of changes to this blog without checking with you first

Contributor Author

adriangb commented Aug 22, 2025

@adriangb I am happy to help (or just do) any of the above suggestions, but I didn't want to take over and push a bunch of changes to this blog without checking with you first

Please go ahead and push! Maybe you can handle those bits and I'll work on getting a figure for join performance?

Contributor

alamb commented Aug 22, 2025

Please go ahead and push! Maybe you can handle those bits and I'll work on getting a figure for join performance?

Will do -- ✍️ 🤓

alamb mentioned this pull request

Dynamic filters blog post (rev 2) #103

Merged

Contributor

alamb commented Aug 22, 2025

I started cleaning this PR up a bit, but I couldn't push commits to it as it was in the pydantic fork, and I don't have permissions

To github.com:apache/datafusion-site.git
 ! [remote rejected] HEAD -> refs/pull/102/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:apache/datafusion-site.git'

So instead, I made a new PR based on a branch in this repo

Dynamic filters blog post (rev 2) #103

Among other benefits, you can now see the previewed version on the staged site: https://datafusion.staged.apache.org/blog/

Contributor

alamb commented Aug 22, 2025

Let's keep working on #103 so closing this PR

alamb closed this

Contributor

alamb commented Oct 16, 2025

URL is https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet