Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

@github-actions
Copy link

github-actions bot commented Oct 2, 2025

Preview URL: https://alamb.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

FYI @etseidl I started writing a blog post about this work. Right now it is a brain dump and isn't ready for review but I wanted to get it out of my mind

Copy link

@cannonpalms cannonpalms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just a rough draft, but I noticed a typo in one of the embedded images I thought might end up getting missed in review due to the size of the font. ❤️

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start! Thanks for doing this! ❤️

alamb and others added 5 commits October 10, 2025 09:30
@scovich
Copy link

scovich commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

I tried to follow the directions, but it did not seem to work. I'll double check

@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2025

I'm unable to access the "preview" link; the bot comment suggests that @alamb needs to enable some setting on his fork?

I tried to follow the directions, but it did not seem to work. I'll double check

For some reason the publish to fork workflow was skipped: https://github.com/apache/arrow-site/actions/runs/18410526447/job/52461618507?pr=711

The workflow file that github reports https://github.com/apache/arrow-site/actions/runs/18410526447/workflow?pr=711 seems to be ok, but I am not a github actions expert

    name: Deploy on fork
    if: >-
      github.event_name == 'push' &&
      github.repository != 'apache/arrow-site'
    needs: build

@kou
Copy link
Member

kou commented Oct 11, 2025

The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340

You need to the followings on https://github.com/alamb/arrow-site/settings/pages and https://github.com/alamb/arrow-site/settings/environments :

https://github.com/apache/arrow-site/blob/main/README.md#forks

  1. Enable GitHub Pages on your fork:
    1. Open https://github.com/${YOUR_GITHUB_ACCOUNT}/arrow-site/settings/pages
    2. Select "GitHub Actions" as "Source"
  2. Accept publishing GitHub Pages from all branches on your fork:
    1. Open https://github.com/${YOUR_GITHUB_ACCOUNT}/arrow-site/settings/environments
    2. Select the "github-pages" environment
      1. Change the default "Deployment branches and tags" rule:
      2. Press the "Edit" button
      3. Change the "Name pattern" to * from main or gh-pages

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

The preview on fork is deployed on your fork's GitHub Actions not apache/arrow-site's GitHub Actions: https://github.com/alamb/arrow-site/actions/runs/18410524340

Thank you @kou

I had previously tried to follow those instructions and could not get it to work for some reason

Here is the content of
https://github.com/alamb/arrow-site/settings/pages
Screenshot 2025-10-14 at 12 31 14 PM

https://github.com/alamb/arrow-site/settings/environments -->
https://github.com/alamb/arrow-site/settings/environments/9063572923/edit
Screenshot 2025-10-14 at 12 32 50 PM

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

For some reason the branch protection rule is preventing it: https://github.com/alamb/arrow-site/actions/runs/18410524340

Screenshot 2025-10-14 at 12 36 08 PM

I will look more carefully

@alamb
Copy link
Contributor Author

alamb commented Oct 14, 2025

Update: I changed my branch protection rule to "no restrictions"

Screenshot 2025-10-14 at 1 05 37 PM

Probably not the best approach, but now the preview link does work: https://alamb.github.io/arrow-site/blog/2025/10/08/rust-parquet-metadata/

approach. Please see the [final PR] for details of the level of effort involved.

[final PR]: https://github.com/apache/arrow-rs/pull/8530
[Jörn Horstmann]: https://github.com/jhorstmann
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jhorstmann you have a shout out in this blog -- please let me know if you would like any changes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! I actually plan to do a presentation on the macros at an internal Rust meetup and will then also update the readme of the compact-thrift repository with more details. The details how the macros work are probably out of scope for this blog post, but could be added to the arrow-rs code base later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alamb and others added 3 commits October 15, 2025 13:27
@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

results

@etseidl
Copy link
Contributor

etseidl commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁

@alamb
Copy link
Contributor Author

alamb commented Oct 15, 2025

@etseidl I re-ran the numbers with the latest code from the arrow-rs main branch. Your optimization work is paying off -- we are now 3x faster than 56.2 😮 🤓

I was just re-running your benchmark branch on my workstation and was composing a message the same effect 😁

I can't tell you how much I am currently grinning. You have basically achieved what @XiangpengHao predicated 2 years ago (2x-4x speedup).

@etseidl
Copy link
Contributor

etseidl commented Oct 15, 2025

Now we just need to implement the metadata index and parse the footer in parallel 🤣

2. It typically maps one-to-one with Thrift definitions, limiting
additional optimizations such as zero-copy parsing, field
skipping, and amortized memory allocation strategies.
3. Its API is very stable (hard to change), which is important for easy maintenance when a large number

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth mentioning that arrow-rs already did some postprocessing on the generated code, and also included a custom implementation of the compact protocol api. That makes the step to a completely custom parser slightly smaller and less crazy :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 7621a3e

@alamb
Copy link
Contributor Author

alamb commented Oct 20, 2025

My plan here is to wait for the arrow 57 release to be published (eta in about 2 days), and then rerun the benchmarks again with the final released version and then publish this blog

If anyone else would like more time to review, please just leave comments

@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2025

BTW I re-ran the numbers with the final arrow 57 release

It is even better than before (thanks to @jhorstmann 's late arriving optimization 🚀 )

results

I am going to read this blog once more and then publish it

@alamb alamb merged commit 5367c4d into apache:main Oct 23, 2025
3 checks passed
@alamb
Copy link
Contributor Author

alamb commented Oct 23, 2025

Blog is live: https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/

@alamb alamb deleted the alamb/new_parquet_metadata branch October 23, 2025 18:26
alamb added a commit that referenced this pull request Oct 30, 2025
- Closes  apache/arrow-rs#8463

Preview URL:
https://alamb.github.io/arrow-site/blog/2025/09/04/arrow-rs-57.0.0/

This release has a crazy amount of content so we should tell the world
about it. Here are two related blogs:
- #712
- #711

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about new rust Metadata Parser

6 participants