-
Notifications
You must be signed in to change notification settings - Fork 22
[BLOG] Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…es on Apache Parquet
|
FYI @XiangpengHao @zhuqi-lucas and @JigaoLuo as you may be interested in this content |
|
FYI @nuno-faria @shehabgamin @jonathanc-n @zhuqi-lucas and @etseidl as you are mentioned in the blog post |
|
This PR is now ready for review |
|
Really solid blog post! |
zhuqi-lucas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , great work, LGTM! Left minor comments about the typo.
nuno-faria
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb nice post and thanks for the shout-out! I leave some minor suggestions below.
|
|
||
| [Clickhouse MergeTree]: https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree | ||
| [Clickhouse indexing strategy]: https://clickhouse.com/docs/guides/best-practices/sparse-primary-indexes#clickhouse-index-design | ||
| [Parquet Format]: https://parquet.apache.org/documentation/latest/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To https://parquet.apache.org/documentation/latest/ :
This link doesn’t seem to work on my end—though I’m not sure if it’s an issue on my side.
Do we need this page instead? https://parquet.apache.org/docs/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you -- this appears to be a link hallucinated by Copilot. fixed
| [ClickBench]: https://clickbench.com/ | ||
| [companion video]: https://www.youtube.com/watch?v=74YsJT1-Rdk | ||
|
|
||
| # Apache Parquet Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Section "Apache Parquet Overview":
This might just be a reflection of my reading habits, but I feel the Parquet overview as background just appears in the middle of the blog.
Since we've already covered a lot about Parquet in previous sections, perhaps we can skip the “Parquet 101” section and focus solely on showcasing the pushdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree for many readers the Parquet background may not be necessary
However, I would like to leave it in this blog post so that the post is as self contained as possible -- I would like the content to be approachable by anyone with an interest in learning about how these filtering / pushdowns work, not just those who already know they want to build an external index for Parquet based systems
Hopefully this make sense. Do you think adding a footnote explaining the rationale would help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense to me. Keeping it self-contained is a good idea, and a footnote is also helpful to clarify the intent without breaking the flow. 🚀
|
Nice blog @alamb. Thanks for having me here! I’ve done my first pass, and I think the topic is great. I’ve left a few comments in the review. One note on the structure, and it might be worth discussing here as well:
|
|
So glad to see this blog post coming together, nice work @alamb 🚀 |
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice blog post thanks Andrew. I've left some minor suggestions
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb
Solid, easy to read, fundamental and well structured
|
I am starting to work through the comments left on this PR |
Filed a ticket to address that |
…site into site/external_indexes
Co-authored-by: Adam Reeve <adreeve@gmail.com>
…site into site/external_indexes
❤️ thank you for taking the time to provide feedback
@nuno-faria mentioned this too and I have demoted all sections one level. Hopefully that is clearer.
In my mind there is a balance between:
I believe the post will be more widely read if it is about more than just Parquet and Datafusion, and that by having the background content it will be easier for people to even realize this is a technique that they can use. So I guess I would say the structure is deliberate, but I can see how it may not be obvious Let me know if that makes sense |
… indexes comparing to file level indexes in terms of IO
|
Thank you all for your comments. I think I have addressed all of them and I will plan to address any more comments and publish this blog on Friday, Aug 15 Thanks again -- the feedback on this draft post was really helpful |


This is my attempt at technical evangelism / explanation about when one would use external indexes and how to do so with DataFusion
Rendered Preview: https://datafusion.staged.apache.org/blog/2025/08/15/external-parquet-indexes/