Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Aug 8, 2025

This is my attempt at technical evangelism / explanation about when one would use external indexes and how to do so with DataFusion

Rendered Preview: https://datafusion.staged.apache.org/blog/2025/08/15/external-parquet-indexes/

@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2025

FYI @XiangpengHao @zhuqi-lucas and @JigaoLuo as you may be interested in this content

@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2025

FYI @nuno-faria @shehabgamin @jonathanc-n @zhuqi-lucas and @etseidl as you are mentioned in the blog post

@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2025

This PR is now ready for review

@shehabgamin
Copy link

Really solid blog post!

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb , great work, LGTM! Left minor comments about the typo.

Copy link
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb nice post and thanks for the shout-out! I leave some minor suggestions below.

@nuno-faria
Copy link
Contributor

Unrelated to this blog post, but I feel the content should have a max width so it becomes easier to read in larger displays. For example, this is how it looks by default:
image

While this is what it looks with 1000px max width:
image


[Clickhouse MergeTree]: https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree
[Clickhouse indexing strategy]: https://clickhouse.com/docs/guides/best-practices/sparse-primary-indexes#clickhouse-index-design
[Parquet Format]: https://parquet.apache.org/documentation/latest/
Copy link
Contributor

@JigaoLuo JigaoLuo Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To https://parquet.apache.org/documentation/latest/ :
This link doesn’t seem to work on my end—though I’m not sure if it’s an issue on my side.

Do we need this page instead? https://parquet.apache.org/docs/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you -- this appears to be a link hallucinated by Copilot. fixed

[ClickBench]: https://clickbench.com/
[companion video]: https://www.youtube.com/watch?v=74YsJT1-Rdk

# Apache Parquet Overview
Copy link
Contributor

@JigaoLuo JigaoLuo Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Section "Apache Parquet Overview":

This might just be a reflection of my reading habits, but I feel the Parquet overview as background just appears in the middle of the blog.

Since we've already covered a lot about Parquet in previous sections, perhaps we can skip the “Parquet 101” section and focus solely on showcasing the pushdown?

Copy link
Contributor Author

@alamb alamb Aug 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for many readers the Parquet background may not be necessary

However, I would like to leave it in this blog post so that the post is as self contained as possible -- I would like the content to be approachable by anyone with an interest in learning about how these filtering / pushdowns work, not just those who already know they want to build an external index for Parquet based systems

Hopefully this make sense. Do you think adding a footnote explaining the rationale would help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me. Keeping it self-contained is a good idea, and a footnote is also helpful to clarify the intent without breaking the flow. 🚀

@JigaoLuo
Copy link
Contributor

JigaoLuo commented Aug 9, 2025

Nice blog @alamb. Thanks for having me here! I’ve done my first pass, and I think the topic is great. I’ve left a few comments in the review.

One note on the structure, and it might be worth discussing here as well:

  • I found the titles of the subtopics we’re covering to be quite clear, but the number of top-level sections (# in Markdown) seems to exceed the actual number of distinct subtopics.
  • That’s also why I think the “Apache Parquet Overview” section felt a bit out of place—it appears suddenly in the middle of the blog.
    • I definitely think including background on Parquet is important. What I meant was that we could consolidate all the Parquet-related background content into a single top-level section, rather than having it scattered throughout.
  • It’s possible I misunderstood the intended structure, so feel free to clarify if that’s the case.

@Omega359
Copy link
Contributor

Omega359 commented Aug 9, 2025

So glad to see this blog post coming together, nice work @alamb 🚀

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog post thanks Andrew. I've left some minor suggestions

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb
Solid, easy to read, fundamental and well structured

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2025

I am starting to work through the comments left on this PR

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2025

Unrelated to this blog post, but I feel the content should have a max width so it becomes easier to read in larger displays. For example, this is how it looks by default:

Filed a ticket to address that

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2025

@JigaoLuo

Nice blog @alamb. Thanks for having me here! I’ve done my first pass, and I think the topic is great. I’ve left a few comments in the review.

❤️ thank you for taking the time to provide feedback

One note on the structure, and it might be worth discussing here as well:

  • I found the titles of the subtopics we’re covering to be quite clear, but the number of top-level sections (# in Markdown) seems to exceed the actual number of distinct subtopics.

@nuno-faria mentioned this too and I have demoted all sections one level. Hopefully that is clearer.

  • That’s also why I think the “Apache Parquet Overview” section felt a bit out of place—it appears suddenly in the middle of the blog.
    • I definitely think including background on Parquet is important. What I meant was that we could consolidate all the Parquet-related background content into a single top-level section, rather than having it scattered throughout.
  • It’s possible I misunderstood the intended structure, so feel free to clarify if that’s the case.

In my mind there is a balance between:

  1. Showing / demonstrating how to use external indexes for Parquet using DataFusion
  2. Explaining the general concept of external indexes / heirarchal pruning

I believe the post will be more widely read if it is about more than just Parquet and Datafusion, and that by having the background content it will be easier for people to even realize this is a technique that they can use.

So I guess I would say the structure is deliberate, but I can see how it may not be obvious

Let me know if that makes sense

@alamb
Copy link
Contributor Author

alamb commented Aug 12, 2025

Thank you all for your comments. I think I have addressed all of them and I will plan to address any more comments and publish this blog on Friday, Aug 15

Thanks again -- the feedback on this draft post was really helpful

@alamb alamb merged commit b23bd7a into main Aug 15, 2025
1 check passed
@alamb alamb deleted the site/external_indexes branch August 15, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about using external indexes with Parquet

8 participants