Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Sep 23, 2025

Fixes #2648

Replaces #2805

This PR introduces a preferSingleWACZ query parameter to the /all-crawls/<crawl_id>/download and /crawls/<crawl_id>/download endpoints. When set to true, these endpoints will only create multi-WACZs when a crawl has more than one WACZ file, and otherwise will stream the original crawl WACZ.

This flag is not enabled by default to prevent introducing breaking changes to the API, but the frontend is updated to use it in all places where it seemed appropriate.

A new backend test is also added to account for the change.

Comments and suggestions on other ways to implement this behavior are very welcome!

For crawls and all-crawls endpoints, this commit adds an optional
preferSingleWACZ query param which will download an archived item
as a single WACZ file when possible instead of repackaging single
WACZs into multiWACZs.
@tw4l tw4l requested review from SuaYoo and ikreymer September 23, 2025 20:03
@tw4l tw4l changed the title Downlaod archived items as single WACZ file when possible Download archived items as single WACZ file when possible via new download endpoint query parameter Sep 23, 2025
Download Files -> Download All
update help text to indicate all files downloaded as a single WACZ
Copy link
Member

@ikreymer ikreymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Made some tweaks to the button and label, eg. 'Download All' instead of 'Download Files' and changed the label to 'Download all as a single WACZ' so that it covers all cases.

@ikreymer
Copy link
Member

ikreymer commented Oct 1, 2025

This works well, but wonder if it would be worth it to just implement this as a redirect to the S3 endpoint instead?
I think we might be able to do a custom presign to override the content-disposition also..
It would save having to stream the content that's already available..

@ikreymer
Copy link
Member

ikreymer commented Oct 1, 2025

This works well, but wonder if it would be worth it to just implement this as a redirect to the S3 endpoint instead? I think we might be able to do a custom presign to override the content-disposition also.. It would save having to stream the content that's already available..

Since redirect would have same API, can revisit it in a future optimization, and add this for now in interest of solving the issue.

@ikreymer ikreymer merged commit c508d48 into main Oct 1, 2025
29 checks passed
@ikreymer ikreymer deleted the issue-2648-single-wacz branch October 1, 2025 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Change]: If only one WACZ file is available to download from archived items, provide a regular WACZ

3 participants