-
Notifications
You must be signed in to change notification settings - Fork 22
Add blog post on extending SQL in DataFusion #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @geoffreyclaude -- this is really great. I had some small comments but I also think the blog could be published as is.
Note we are having some troubles at the moment with publishing (see https://issues.apache.org/jira/browse/INFRA-27512 for gory details) but I expect that will be sorted out shortly
|
|
||
| If you embed [DataFusion][apache datafusion] in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements. | ||
|
|
||
| Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit suggestion:
| Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible: | |
| Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, and maybe you really want something even more flexible: |
|
|
||
| Each stage has extension points. | ||
|
|
||
| <figure> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you should see the static webpage I vibe-coded to tune the svg 😆
|
|
||
| DataFusion turns SQL into executable work in stages: | ||
|
|
||
| 1. **Parse**: SQL text is parsed into an AST (`Statement` from [sqlparser-rs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor -- it would be nice to add links to the docs for these sturctures
Statement: https://docs.rs/sqlparser/latest/sqlparser/ast/enum.Statement.html
SqlToRel: https://docs.rs/datafusion/latest/datafusion/sql/planner/struct.SqlToRel.html
LogicalPlan: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
PhysicalPlanner: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
ExecutionPlan: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html
(not only might this help readers, it also subtlety shows off the documentation available in DataFusion)
|
|
||
| ## 1) Extending parsing: wrapping `DFParser` for custom statements | ||
|
|
||
| The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion. | |
| The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead in the token stream to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion. |
|
|
||
| `RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations. | ||
|
|
||
| [@theirix]'s `TABLESAMPLE` work ([#13563], [#17633]) demonstrated exactly where the gap was: the extension only worked when `TABLESAMPLE` appeared at the query root and any `TABLESAMPLE` inside a CTE or JOIN would error. That limitation motivated [#17843], which introduced `RelationPlanner` to intercept relations at any nesting level. The same hook now supports `PIVOT`, `UNPIVOT`, `TABLESAMPLE`, and can translate dialect-specific FROM-clause syntax (for example, bridging Trino constructs into DataFusion plans). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @theirix ❤️
|
|
||
| Some extensions change what a _relation_ means, not just expressions or types. `RelationPlanner` intercepts FROM-clause constructs while SQL is being converted into a `LogicalPlan`. | ||
|
|
||
| `RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest considering moving these paragraphs about the design history to after showing how it works (strategy A and strategy B sections) and putting it in its own sub section named something like like "Background" or "Origin of the API"
I think that would
- Make this section more consistent with the rest of the sections
- Make it easier to quickly find the (great) examples here for people who are rushing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also make sense to mention that RelationPlanner will be available starting in DataFusion 52
|
|
||
| ### Strategy B: custom logical + physical (TABLESAMPLE) | ||
|
|
||
| Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it. | |
| Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a table and is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it. |
| println!("{}", df.logical_plan().display_indent()); | ||
| ``` | ||
|
|
||
| ### Use `EXPLAIN` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding a link to the docs: https://datafusion.apache.org/user-guide/sql/explain.html
|
|
||
| ## Acknowledgements | ||
|
|
||
| Thank you to [@jayzhan211] for designing and implementing the original `ExprPlanner` API ([#11180]), to [@goldmedal] for adding `TypePlanner` ([#13294]), and to [@theirix] for the `TABLESAMPLE` work ([#13563], [#17633]) that helped shape `RelationPlanner`. Thank you to [@alamb] for driving DataFusion's extensibility philosophy and for feedback on this post. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review @alamb ! Let's hold off publishing until DataFusion 52 is released, otherwise we'll have dead links to docs and mention unavailable features 🫣 EDIT: I did a quick pass to update with your suggestions in a new commit. |
Makes sense to me -- thanks @geoffreyclaude
❤️ |
|
Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄 |
Me neither tbh 😄I learned a lot researching this. Thing is, at Datadog we use the Substrait path which hooks in after all the SQL parsing magic, so we directly build our custom logical and physical nodes. |
Yeah, this is why I think blogs like this are so valuable -- they give a high level description of what is possible. Without them people need to be into the code to figure it out, and deciding to dive into the code is a pretty high bar when just deciding to use a system or not So thank you (again) @geoffreyclaude |
|
With the impending release of DataFusion 52.0.0 I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog |
All good for me of course! Especially now that 52 is officially released! |
Awesome -- thank you -- I updated the date to today and I plan to publish it shortly |
|
Ok, let's get this thing published! |
|
The blog is live: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ 🎉 |
|
A great read and incredible work - thank you, @geoffreyclaude and reviewers! |



Which issue does this PR close?
Rationale for this change
DataFusion's SQL extensibility APIs are powerful but not widely known outside the contributor community. The library user guide added in apache/datafusion#19265 documents the interfaces, but there wasn't a narrative introduction showing when and why you'd use each one.
This post walks through real scenarios and shows which extension point to reach for in each case.
What changes are included in this PR?
New blog post at
content/blog/2025-12-18-extending-sql.md.The post uses
CREATE EXTERNAL CATALOGas a running example to show how custom syntax flows through DataFusion's parse → plan → execute pipeline. It then covers each extension point: parser wrapping for custom DDL,ExprPlannerfor operators like->>,TypePlannerfor dialect-specific types, andRelationPlannerfor FROM-clause constructs likePIVOTandTABLESAMPLE.All code snippets link to working examples in
datafusion-examples. There's also an architecture diagram showing where each hook fits in the pipeline.