-
Notifications
You must be signed in to change notification settings - Fork 122
[Website]: Blog post about new Rust Parquet Metadata parser #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Preview URL: https://alamb.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
|
FYI @etseidl I started writing a blog post about this work. Right now it is a brain dump and isn't ready for review but I wanted to get it out of my mind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just a rough draft, but I noticed a typo in one of the embedded images I thought might end up getting missed in review due to the size of the font. ❤️
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start! Thanks for doing this! ❤️
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
…ite into alamb/new_parquet_metadata
|
I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork? |
I tried to follow the directions, but it did not seem to work. I'll double check |
For some reason the publish to fork workflow was skipped: https://github.com/apache/arrow-site/actions/runs/18410526447/job/52461618507?pr=711 The workflow file that github reports https://github.com/apache/arrow-site/actions/runs/18410526447/workflow?pr=711 seems to be ok, but I am not a github actions expert name: Deploy on fork
if: >-
github.event_name == 'push' &&
github.repository != 'apache/arrow-site'
needs: build |
|
The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340 You need to the followings on https://github.com/alamb/arrow-site/settings/pages and https://github.com/alamb/arrow-site/settings/environments : https://github.com/apache/arrow-site/blob/main/README.md#forks
|
Thank you @kou I had previously tried to follow those instructions and could not get it to work for some reason Here is the content of https://github.com/alamb/arrow-site/settings/environments --> |
|
For some reason the branch protection rule is preventing it: https://github.com/alamb/arrow-site/actions/runs/18410524340
I will look more carefully |
|
Update: I changed my branch protection rule to "no restrictions"
Probably not the best approach, but now the preview link does work: https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/ |
…ite into alamb/new_parquet_metadata
| approach. Please see the [final PR] for details of the level of effort involved. | ||
|
|
||
| [final PR]: https://github.com/apache/arrow-rs/pull/8530 | ||
| [Jörn Horstmann]: https://github.com/jhorstmann |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @jhorstmann you have a shout out in this blog -- please let me know if you would like any changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good! I actually plan to do a presentation on the macros at an internal Rust meetup and will then also update the readme of the compact-thrift repository with more details. The details how the macros work are probably out of scope for this blog post, but could be added to the arrow-rs code base later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- note that @etseidl also write up a readme here that is quite good:
https://github.com/apache/arrow-rs/blob/49d92fa163f61d677a971143c598ad9f020f8fec/parquet/THRIFT.md
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
…ite into alamb/new_parquet_metadata
|
@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓
|
I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁 |
I can't tell you how much I am currently grinning. You have basically achieved what @XiangpengHao predicated 2 years ago (2x-4x speedup). |
|
Now we just need to implement the metadata index and parse the footer in parallel 🤣 |
| 2. It typically maps one-to-one with Thrift definitions, limiting | ||
| additional optimizations such as zero-copy parsing, field | ||
| skipping, and amortized memory allocation strategies. | ||
| 3. Its API is very stable (hard to change), which is important for easy maintenance when a large number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth mentioning that arrow-rs already did some postprocessing on the generated code, and also included a custom implementation of the compact protocol api. That makes the step to a completely custom parser slightly smaller and less crazy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 7621a3e
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
…ite into alamb/new_parquet_metadata
|
My plan here is to wait for the arrow 57 release to be published (eta in about 2 days), and then rerun the benchmarks again with the final released version and then publish this blog If anyone else would like more time to review, please just leave comments |
|
BTW I re-ran the numbers with the final arrow 57 release It is even better than before (thanks to @jhorstmann 's late arriving optimization 🚀 )
I am going to read this blog once more and then publish it |
- Closes apache/arrow-rs#8463 Preview URL: https://alamb.github.io/arrow-site/blog/2025/09/04/arrow-rs-57.0.0/ This release has a crazy amount of content so we should tell the world about it. Here are two related blogs: - #712 - #711 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>






Part of the work to write a new metadata parser in Rust is to tell people about it:
Preview URL: