Skip to content

Conversation

@geoffreyclaude
Copy link
Contributor

@geoffreyclaude geoffreyclaude commented Dec 18, 2025

Which issue does this PR close?

Rationale for this change

DataFusion's SQL extensibility APIs are powerful but not widely known outside the contributor community. The library user guide added in apache/datafusion#19265 documents the interfaces, but there wasn't a narrative introduction showing when and why you'd use each one.

This post walks through real scenarios and shows which extension point to reach for in each case.

What changes are included in this PR?

New blog post at content/blog/2025-12-18-extending-sql.md.

The post uses CREATE EXTERNAL CATALOG as a running example to show how custom syntax flows through DataFusion's parse → plan → execute pipeline. It then covers each extension point: parser wrapping for custom DDL, ExprPlanner for operators like ->>, TypePlanner for dialect-specific types, and RelationPlanner for FROM-clause constructs like PIVOT and TABLESAMPLE.

All code snippets link to working examples in datafusion-examples. There's also an architecture diagram showing where each hook fits in the pipeline.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @geoffreyclaude -- this is really great. I had some small comments but I also think the blog could be published as is.

Note we are having some troubles at the moment with publishing (see https://issues.apache.org/jira/browse/INFRA-27512 for gory details) but I expect that will be sorted out shortly


If you embed [DataFusion][apache datafusion] in your product, your users will eventually run SQL that DataFusion does not recognize. Not because the query is unreasonable, but because SQL in practice includes many dialects and system-specific statements.

Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit suggestion:

Suggested change
Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, but what you really want is something more flexible:
Suppose you store data as Parquet files on S3 and want users to attach an external catalog to query them. DataFusion has `CREATE EXTERNAL TABLE` for individual tables, but no built-in equivalent for catalogs. DuckDB has `ATTACH`, SQLite has its own variant, and maybe you really want something even more flexible:


Each stage has extension points.

<figure>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing figure ❤️
Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you should see the static webpage I vibe-coded to tune the svg 😆


DataFusion turns SQL into executable work in stages:

1. **Parse**: SQL text is parsed into an AST (`Statement` from [sqlparser-rs])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## 1) Extending parsing: wrapping `DFParser` for custom statements

The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.
The `CREATE EXTERNAL CATALOG` syntax from the introduction fails at the parser because DataFusion only recognizes `CREATE EXTERNAL TABLE`. To support new statement-level syntax, you can **wrap `DFParser`**. Peek ahead in the token stream to detect your custom syntax, handle it yourself, and delegate everything else to DataFusion.


`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.

[@theirix]'s `TABLESAMPLE` work ([#13563], [#17633]) demonstrated exactly where the gap was: the extension only worked when `TABLESAMPLE` appeared at the query root and any `TABLESAMPLE` inside a CTE or JOIN would error. That limitation motivated [#17843], which introduced `RelationPlanner` to intercept relations at any nesting level. The same hook now supports `PIVOT`, `UNPIVOT`, `TABLESAMPLE`, and can translate dialect-specific FROM-clause syntax (for example, bridging Trino constructs into DataFusion plans).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @theirix ❤️


Some extensions change what a _relation_ means, not just expressions or types. `RelationPlanner` intercepts FROM-clause constructs while SQL is being converted into a `LogicalPlan`.

`RelationPlanner` originally came out of trying to build `MATCH_RECOGNIZE` support in DataFusion as a Datadog hackathon project. `MATCH_RECOGNIZE` is a complex SQL feature for detecting patterns in sequences of rows, and it made sense to prototype as an extension first. At the time, DataFusion had no extension point at the right stage of SQL-to-rel planning to intercept and reinterpret relations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest considering moving these paragraphs about the design history to after showing how it works (strategy A and strategy B sections) and putting it in its own sub section named something like like "Background" or "Origin of the API"

I think that would

  1. Make this section more consistent with the rest of the sections
  2. Make it easier to quickly find the (great) examples here for people who are rushing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also make sense to mention that RelationPlanner will be available starting in DataFusion 52


### Strategy B: custom logical + physical (TABLESAMPLE)

Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a tableand is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.
Sometimes rewriting is not sufficient. `TABLESAMPLE` returns a random subset of rows from a table and is useful for approximations or debugging on large datasets. Because it requires runtime randomness, you cannot express it as a rewrite to existing operators. Instead, you need a custom logical node and physical operator to execute it.

println!("{}", df.logical_plan().display_indent());
```

### Use `EXPLAIN`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding a link to the docs: https://datafusion.apache.org/user-guide/sql/explain.html


## Acknowledgements

Thank you to [@jayzhan211] for designing and implementing the original `ExprPlanner` API ([#11180]), to [@goldmedal] for adding `TypePlanner` ([#13294]), and to [@theirix] for the `TABLESAMPLE` work ([#13563], [#17633]) that helped shape `RelationPlanner`. Thank you to [@alamb] for driving DataFusion's extensibility philosophy and for feedback on this post.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoffreyclaude
Copy link
Contributor Author

geoffreyclaude commented Dec 19, 2025

Thank you @geoffreyclaude -- this is really great. I had some small comments but I also think the blog could be published as is.

Note we are having some troubles at the moment with publishing (see https://issues.apache.org/jira/browse/INFRA-27512 for gory details) but I expect that will be sorted out shortly

Thanks for the quick review @alamb ! Let's hold off publishing until DataFusion 52 is released, otherwise we'll have dead links to docs and mention unavailable features 🫣

EDIT: I did a quick pass to update with your suggestions in a new commit.

@alamb
Copy link
Contributor

alamb commented Dec 20, 2025

Thanks for the quick review @alamb ! Let's hold off publishing until DataFusion 52 is released, otherwise we'll have dead links to docs and mention unavailable features 🫣

Makes sense to me -- thanks @geoffreyclaude

EDIT: I did a quick pass to update with your suggestions in a new commit.

❤️

@gene-bordegaray
Copy link
Contributor

gene-bordegaray commented Dec 20, 2025

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

@geoffreyclaude
Copy link
Contributor Author

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

Me neither tbh 😄I learned a lot researching this. Thing is, at Datadog we use the Substrait path which hooks in after all the SQL parsing magic, so we directly build our custom logical and physical nodes.

@alamb
Copy link
Contributor

alamb commented Dec 21, 2025

Great read! I had no clue about some of these capabilities. Thanks @geoffreyclaude 😄

Me neither tbh 😄I learned a lot researching this. Thing is, at Datadog we use the Substrait path which hooks in after all the SQL parsing magic, so we directly build our custom logical and physical nodes.

Yeah, this is why I think blogs like this are so valuable -- they give a high level description of what is possible. Without them people need to be into the code to figure it out, and deciding to dive into the code is a pretty high bar when just deciding to use a system or not

So thank you (again) @geoffreyclaude

@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

With the impending release of DataFusion 52.0.0

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

@geoffreyclaude
Copy link
Contributor Author

With the impending release of DataFusion 52.0.0

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

All good for me of course! Especially now that 52 is officially released!

@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

With the impending release of DataFusion 52.0.0

I am hoping we can publish blog early next week (Jan 12, 13) so that we can then refer to it in the DataFusion 52 release blog

All good for me of course! Especially now that 52 is officially released!

Awesome -- thank you -- I updated the date to today and I plan to publish it shortly

@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

I pushed two commits to fix issues I noticed when doing a final proofread

60af67a
Screenshot 2026-01-12 at 4 14 27 PM

c3e94a9
Screenshot 2026-01-12 at 4 15 51 PM

@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

Ok, let's get this thing published!

@alamb alamb merged commit 6e852ae into apache:main Jan 12, 2026
@alamb
Copy link
Contributor

alamb commented Jan 12, 2026

The blog is live: https://datafusion.apache.org/blog/2026/01/12/extending-sql/ 🎉

@theirix
Copy link

theirix commented Jan 12, 2026

A great read and incredible work - thank you, @geoffreyclaude and reviewers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BLOG] Blog post about writing your own SQL dialect / extending SQL with DataFusion

4 participants