Add blog post on extending SQL in DataFusion #130

geoffreyclaude · 2025-12-18T09:38:06Z

Which issue does this PR close?

Rationale for this change

DataFusion's SQL extensibility APIs are powerful but not widely known outside the contributor community. The library user guide added in apache/datafusion#19265 documents the interfaces, but there wasn't a narrative introduction showing when and why you'd use each one.

This post walks through real scenarios and shows which extension point to reach for in each case.

What changes are included in this PR?

New blog post at content/blog/2025-12-18-extending-sql.md.

The post uses CREATE EXTERNAL CATALOG as a running example to show how custom syntax flows through DataFusion's parse → plan → execute pipeline. It then covers each extension point: parser wrapping for custom DDL, ExprPlanner for operators like ->>, TypePlanner for dialect-specific types, and RelationPlanner for FROM-clause constructs like PIVOT and TABLESAMPLE.

All code snippets link to working examples in datafusion-examples. There's also an architecture diagram showing where each hook fits in the pipeline.

alamb

Thank you @geoffreyclaude -- this is really great. I had some small comments but I also think the blog could be published as is.

Note we are having some troubles at the moment with publishing (see https://issues.apache.org/jira/browse/INFRA-27512 for gory details) but I expect that will be sorted out shortly

alamb · 2025-12-19T17:55:40Z

content/blog/2025-12-18-extending-sql.md

+
+If you embed [DataFusion][apache datafusion] in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.
+
+Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible:


Minor nit suggestion:

Suggested change

Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible:

Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, and maybe you really want something even more flexible:

alamb · 2025-12-19T17:57:11Z

content/blog/2026-01-12-extending-sql.md

+
+Each stage has extension points.
+
+<figure>


This is an amazing figure ❤️

And you should see the static webpage I vibe-coded to tune the svg 😆

alamb · 2025-12-19T18:00:12Z

content/blog/2025-12-18-extending-sql.md

+
+DataFusion turns SQL into executable work in stages:
+
+1. **Parse**: SQL text is parsed into an AST (`Statement` from [sqlparser-rs])


Minor -- it would be nice to add links to the docs for these sturctures

Statement: https://docs.rs/sqlparser/latest/sqlparser/ast/enum.Statement.html
SqlToRel: https://docs.rs/datafusion/latest/datafusion/sql/planner/struct.SqlToRel.html
LogicalPlan: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
PhysicalPlanner: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.LogicalPlan.html
ExecutionPlan: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html

(not only might this help readers, it also subtlety shows off the documentation available in DataFusion)

alamb · 2025-12-19T18:01:07Z

content/blog/2025-12-18-extending-sql.md

+
+## 1) Extending parsing: wrapping `DFParser` for custom statements
+
+The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.


Suggested change

The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.

The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead in the token stream to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.

alamb · 2025-12-19T18:03:26Z

content/blog/2025-12-18-extending-sql.md

+
+`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.
+
+[@theirix]'s `TABLESAMPLE` work ([#13563], [#17633]) demonstrated exactly where the gap was: the extension only worked when `TABLESAMPLE` appeared at the query root and any `TABLESAMPLE` inside a CTE or JOIN would error. That limitation motivated [#17843], which introduced `RelationPlanner` to intercept relations at any nesting level. The same hook now supports `PIVOT`, `UNPIVOT`, `TABLESAMPLE`, and can translate dialect-specific FROM-clause syntax (for example, bridging Trino constructs into DataFusion plans).


FYI @theirix ❤️

alamb · 2025-12-19T18:06:41Z

content/blog/2025-12-18-extending-sql.md

+
+Some extensions change what a _relation_ means, not just expressions or types. `RelationPlanner` intercepts FROM-clause constructs while SQL is being converted into a `LogicalPlan`.
+
+`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.


I suggest considering moving these paragraphs about the design history to after showing how it works (strategy A and strategy B sections) and putting it in its own sub section named something like like "Background" or "Origin of the API"

I think that would

Make this section more consistent with the rest of the sections

Make it easier to quickly find the (great) examples here for people who are rushing

It might also make sense to mention that RelationPlanner will be available starting in DataFusion 52

alamb · 2025-12-19T18:07:55Z

content/blog/2025-12-18-extending-sql.md

+
+### Strategy B: custom logical + physical (TABLESAMPLE)
+
+Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.


Suggested change

Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.

Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a table and is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.

alamb · 2025-12-19T18:09:50Z

content/blog/2025-12-18-extending-sql.md

+println!("{}", df.logical_plan().display_indent());
+```
+
+### Use `EXPLAIN`


Maybe worth adding a link to the docs: https://datafusion.apache.org/user-guide/sql/explain.html

alamb · 2025-12-19T18:10:56Z

content/blog/2026-01-12-extending-sql.md

+
+## Acknowledgements
+
+Thank you to [@jayzhan211] for designing and implementing the original `ExprPlanner` API ([#11180]), to [@goldmedal] for adding `TypePlanner` ([#13294]), and to [@theirix] for the `TABLESAMPLE` work ([#13563], [#17633]) that helped shape `RelationPlanner`. Thank you to [@alamb] for driving DataFusion's extensibility philosophy and for feedback on this post.


FYI @jayzhan211 @goldmedal

geoffreyclaude · 2025-12-19T18:24:39Z

Thank you @geoffreyclaude -- this is really great. I had some small comments but I also think the blog could be published as is.

Note we are having some troubles at the moment with publishing (see https://issues.apache.org/jira/browse/INFRA-27512 for gory details) but I expect that will be sorted out shortly

Thanks for the quick review @alamb ! Let's hold off publishing until DataFusion 52 is released, otherwise we'll have dead links to docs and mention unavailable features 🫣

EDIT: I did a quick pass to update with your suggestions in a new commit.

alamb · 2025-12-20T13:15:09Z

Thanks for the quick review @alamb ! Let's hold off publishing until DataFusion 52 is released, otherwise we'll have dead links to docs and mention unavailable features 🫣

Makes sense to me -- thanks @geoffreyclaude

EDIT: I did a quick pass to update with your suggestions in a new commit.

❤️

gene-bordegaray · 2025-12-20T23:26:51Z

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

geoffreyclaude · 2025-12-21T09:19:17Z

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

Me neither tbh 😄I learned a lot researching this. Thing is, at Datadog we use the Substrait path which hooks in after all the SQL parsing magic, so we directly build our custom logical and physical nodes.

alamb · 2025-12-21T14:03:06Z

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

Me neither tbh 😄I learned a lot researching this. Thing is, at Datadog we use the Substrait path which hooks in after all the SQL parsing magic, so we directly build our custom logical and physical nodes.

Yeah, this is why I think blogs like this are so valuable -- they give a high level description of what is possible. Without them people need to be into the code to figure it out, and deciding to dive into the code is a pretty high bar when just deciding to use a system or not

So thank you (again) @geoffreyclaude

alamb · 2026-01-10T19:55:13Z

With the impending release of DataFusion 52.0.0

Release DataFusion 52.0.0 (Dec 2025 / Jan 2026) datafusion#18566

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

Blog post for the DataFusion 52.0.0 release datafusion#19691

geoffreyclaude · 2026-01-12T18:24:35Z

With the impending release of DataFusion 52.0.0

Release DataFusion 52.0.0 (Dec 2025 / Jan 2026) datafusion#18566

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

Blog post for the DataFusion 52.0.0 release datafusion#19691

All good for me of course! Especially now that 52 is officially released!

alamb · 2026-01-12T21:14:02Z

With the impending release of DataFusion 52.0.0

Release DataFusion 52.0.0 (Dec 2025 / Jan 2026) datafusion#18566

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

Blog post for the DataFusion 52.0.0 release datafusion#19691

All good for me of course! Especially now that 52 is officially released!

Awesome -- thank you -- I updated the date to today and I plan to publish it shortly

alamb · 2026-01-12T21:19:07Z

I pushed two commits to fix issues I noticed when doing a final proofread

60af67a

c3e94a9

alamb · 2026-01-12T21:20:12Z

Ok, let's get this thing published!

alamb · 2026-01-12T21:38:08Z

The blog is live: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ 🎉

theirix · 2026-01-12T21:49:48Z

A great read and incredible work - thank you, @geoffreyclaude and reviewers!

Add Extending SQL blog post

6f2db0f

geoffreyclaude mentioned this pull request Dec 18, 2025

[BLOG] Blog post about writing your own SQL dialect / extending SQL with DataFusion apache/datafusion#16756

Closed

alamb approved these changes Dec 19, 2025

View reviewed changes

fix: apply PR review suggestions

04f21a4

alamb mentioned this pull request Jan 7, 2026

Blog post for the DataFusion 52.0.0 release apache/datafusion#19691

Open

alamb added 2 commits January 12, 2026 16:11

Merge remote-tracking branch 'apache/main' into docs/blog_extending_sql

a7f921b

Update publish date to 2026-01-12

df9b862

alamb added 2 commits January 12, 2026 16:16

remove initial header

60af67a

fix links

c3e94a9

alamb merged commit 6e852ae into apache:main Jan 12, 2026


		If you embed [DataFusion][apache datafusion] in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.

		Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible:


		DataFusion turns SQL into executable work in stages:

		1. Parse: SQL text is parsed into an AST (`Statement` from [sqlparser-rs])


		## 1) Extending parsing: wrapping `DFParser` for custom statements

		The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can wrap `DFParser`. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.


		`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.

		[@theirix]'s `TABLESAMPLE` work ([#13563], [#17633]) demonstrated exactly where the gap was: the extension only worked when `TABLESAMPLE` appeared at the query root and any `TABLESAMPLE` inside a CTE or JOIN would error. That limitation motivated [#17843], which introduced `RelationPlanner` to intercept relations at any nesting level. The same hook now supports `PIVOT`, `UNPIVOT`, `TABLESAMPLE`, and can translate dialect-specific FROM-clause syntax (for example, bridging Trino constructs into DataFusion plans).


		Some extensions change what a _relation_ means, not just expressions or types. `RelationPlanner` intercepts FROM-clause constructs while SQL is being converted into a `LogicalPlan`.

		`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.


		### Strategy B: custom logical + physical (TABLESAMPLE)

		Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.


		## Acknowledgements

		Thank you to [@jayzhan211] for designing and implementing the original `ExprPlanner` API ([#11180]), to [@goldmedal] for adding `TypePlanner` ([#13294]), and to [@theirix] for the `TABLESAMPLE` work ([#13563], [#17633]) that helped shape `RelationPlanner`. Thank you to [@alamb] for driving DataFusion's extensibility philosophy and for feedback on this post.

Add blog post on extending SQL in DataFusion #130

Add blog post on extending SQL in DataFusion #130

Conversation

geoffreyclaude commented Dec 18, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geoffreyclaude commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Dec 20, 2025

Uh oh!

gene-bordegaray commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffreyclaude commented Dec 21, 2025

Uh oh!

alamb commented Dec 21, 2025

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

geoffreyclaude commented Jan 12, 2026

Uh oh!

alamb commented Jan 12, 2026

Uh oh!

alamb commented Jan 12, 2026

Uh oh!

alamb commented Jan 12, 2026

Uh oh!

alamb commented Jan 12, 2026

Uh oh!

theirix commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

geoffreyclaude commented Dec 18, 2025 •

edited by alamb

Loading

geoffreyclaude commented Dec 19, 2025 •

edited

Loading

gene-bordegaray commented Dec 20, 2025 •

edited

Loading