Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Aug 19, 2025

@adriangb adriangb marked this pull request as ready for review August 19, 2025 17:13
@adriangb adriangb changed the title DRAFT: Dynamic filters blog post Dynamic filters blog post Aug 19, 2025
@adriangb adriangb requested review from alamb and Copilot August 19, 2025 23:51

This comment was marked as outdated.

@alamb
Copy link
Contributor

alamb commented Aug 20, 2025

I plan to review this PR first thing tomorrow

@adriangb adriangb requested a review from Copilot August 21, 2025 13:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive blog post about DataFusion's dynamic filters optimization feature. The post explains how dynamic filters enable sideways information passing between operators to achieve significant query performance improvements.

  • Introduces dynamic filters concept and implementation for TopK and Hash Join operators
  • Documents performance improvements showing up to 22x speedups for certain query patterns
  • Provides technical implementation details and future work considerations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

adriangb and others added 4 commits August 21, 2025 08:34
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
└───────────────────────────┘
```

## Implementation for Hash Join Operator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some perf results as well here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put something together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to either find an inner join query where our current optimization can be applied and shows significant benefit or we wait until the other join cases are implemented.


* Support for more types of joins: we only implemented support for hash inner joins so far. There's the potential to expand this to other join types both in terms of the physical implementation (nested loop joins, etc.) and join type (e.g. left outer joins, cross joins, etc.).
* Push down entire hash tables to the scan operator: this could potentially help a lot with join keys that are not naturally ordered or have a lot of skew.
* Use file level statistics to order files to match the `ORDER BY` clause as best we can: this will help TopK dynamic filters be more effective by skipping more work earlier in the scan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add the idea to have a single / merged heap for topk to make the pushdown more selective?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to include it because I'm hoping we merge apache/datafusion#16433 before this blog post is published

adriangb and others added 5 commits August 21, 2025 09:06
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
Co-authored-by: Daniël Heres <danielheres@gmail.com>
@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

Thank you very much for this post @adriangb

It looks like there are some layout / visual things that could be improved (likely due to the quirky pelican markdown rendering)

Screenshot 2025-08-22 at 6 52 53 AM

Also, I really like the hook / explanation of the pydantic usecase and the results achieved.

I think this blog would be even stronger (aka likely to attract a wider audience readers) with a more generic background on dynamic filtering before the discussion of how they work in DataFusion -- In general I have found that blogs get more readers and attention if they teach more generic database concepts first and then talk about DataFusion specificailly (some ideas are here apache/datafusion#15513 (comment))

Some other thoughts:

  1. I think the blog would be stronger if it lead with a diagram / visual
  2. Perhaps we can add an "About the author" and "About DataFusion" section (similar to https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/)

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

@adriangb I am happy to help (or just do) any of the above suggestions, but I didn't want to take over and push a bunch of changes to this blog without checking with you first

@adriangb
Copy link
Contributor Author

@adriangb I am happy to help (or just do) any of the above suggestions, but I didn't want to take over and push a bunch of changes to this blog without checking with you first

Please go ahead and push! Maybe you can handle those bits and I'll work on getting a figure for join performance?

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

Please go ahead and push! Maybe you can handle those bits and I'll work on getting a figure for join performance?

Will do -- ✍️ 🤓

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

I started cleaning this PR up a bit, but I couldn't push commits to it as it was in the pydantic fork, and I don't have permissions

To github.com:apache/datafusion-site.git
 ! [remote rejected] HEAD -> refs/pull/102/head (deny updating a hidden ref)
error: failed to push some refs to 'github.com:apache/datafusion-site.git'

So instead, I made a new PR based on a branch in this repo

Among other benefits, you can now see the previewed version on the staged site: https://datafusion.staged.apache.org/blog/

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

Let's keep working on #103 so closing this PR

@alamb alamb closed this Aug 22, 2025
@alamb
Copy link
Contributor

alamb commented Oct 16, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about TopK filter pushdown

3 participants