fix: validate and buffer downloaded assets before S3 upload during migrations #8865
No reviewers
Labels
No labels
arch
riscv64
backport/v1.19
backport/v1.20
backport/v1.21/forgejo
backport/v10.0/forgejo
backport/v11.0/forgejo
backport/v12.0/forgejo
backport/v13.0/forgejo
backport/v14.0/forgejo
backport/v15.0/forgejo
backport/v7.0/forgejo
backport/v8.0/forgejo
backport/v9.0/forgejo
breaking
bug
bug
confirmed
bug
duplicate
bug
needs-more-info
bug
new-report
bug
reported-upstream
code/actions
code/api
code/auth
code/auth/faidp
code/auth/farp
code/email
code/federation
code/git
code/migrations
code/packages
code/wiki
database
MySQL
database
PostgreSQL
database
SQLite
dependency-upgrade
dependency
Chi
dependency
Chroma
dependency
F3
dependency
ForgeFed
dependency
garage
dependency
Gitea
dependency
Golang
Discussion
duplicate
enhancement/feature
forgejo/accessibility
forgejo/branding
forgejo/ci
forgejo/commit-graph
forgejo/documentation
forgejo/furnace cleanup
forgejo/i18n
forgejo/interop
forgejo/moderation
forgejo/privacy
forgejo/release
forgejo/scaling
forgejo/security
forgejo/ui
Gain
High
Gain
Nice to have
Gain
Undefined
Gain
Very High
good first issue
i18n/backport-stable
impact
large
impact
medium
impact
small
impact
unknown
Incompatible license
issue
closed
issue
do-not-exist-yet
issue
open
manual test
Manually tested during feature freeze
OS
FreeBSD
OS
Linux
OS
macOS
OS
Windows
problem
QA
regression
release blocker
Release Cycle
Feature Freeze
release-blocker
v7.0
release-blocker
v7.0.1
release-blocker
v7.0.2
release-blocker
v7.0.3
release-blocker
v7.0.4
release-blocker
v8.0.0
release-blocker/v9.0.0
run-all-playwright-tests
run-end-to-end-tests
stage
2-research
stage
3-design
stage
4-implementation
test
manual
test
needed
test
needs-help
test
not-needed
test
present
untested
User research - time-tracker
valuable code
worth a release-note
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/forgejo!8865
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "pat-s/forgejo:fix/migrate-buffer-s3"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This fixes an issue when migrating a repo (with release assets) into a FJ instance with an S3 backend for asset storage.
It ensures that the download fully completes and is validated before it is pushed to S3, preventing discrepancies between the expected content length and the actual body.
Without this fix, the following error occurs during a migration:
Before: GitHub Download → S3 Upload (while streaming)
After: GitHub → [Download Complete] → [Buffer & Validate] → S3 Storage
I verified that this patch prevents the error reported above.
(should likely be backported)
Checklist
The contributor guide contains information that will be helpful to first time contributors. There also are a few conditions for merging Pull Requests in Forgejo repositories. You are also welcome to join the Forgejo development chatroom.
Tests
*_test.gofor unit tests.tests/integrationdirectory if it involves interactions with a live Forgejo server.web_src/js/*.test.jsif it can be unit tested.tests/e2e/*.test.e2e.jsif it requires interactions with a live Forgejo server (see also the developer guide for JavaScript testing).Documentation
Release notes
release-notes/<pull request number>.mdto be be used for the release notes instead of the title.Could you explain the root cause of this fix? I'm not seeing it. I do get that the asset is read in memory before being sent, but I don't get why it would fail if it is not? Why is there a size difference between the advertised content length and the size of the body?
The downside of this fix is that large assets will significantly more memory when migrating. I'm not saying it is a blocker in case this is the only way to fix the bug.
I don't know why that is, I only see that there is a difference.
The process flow I posted in OP is more or less just a guess based on the result (i.e. a premature write request before the actual download is fully complete and/or buffered, causing body length != content length for the subsequent upload request).
The issue should be reproducible with any instance that uses S3 for asset storage and also unlikely to be linked to the size of release assets. At first I thought it might be related to reverse proxy settings, but I was able to out rule it after some tests. Then I checked for the above and it turned out to be the culprit.
And it might be slower overall since there is more waiting time for the new process. But I don't mind if the alternative is that the migration would error out.
Happy to take a different approach if you have one, but I generally also think this fine overall.
You won't be able to reproduce it on Codeberg as S3 isn't used for the asset storage, but if you have another instance with S3, it might be worth giving it a try.
(I fell over this yesterday attempting to migrate one of my repos from GH).
I'm surprised that was never reported but it is good you can reproduce it. Can you add a reproducer in the integration tests?
t.Skip("Test skipped for non-Minio-storage.")is one place you can check see how to guard a Minio specific test.Me too but I guess most instances don't use S3 for asset storage (?). Maybe there's also another player in all of this which I am not seeing yet. But given that the patch here makes it work/fail reliably, it looks to me as if this is the main culprit.
I'd appreciate if one with an s3 asset backend could also double-verify this issue.
(Maybe all of this is also somehow related to the S3/minio closer debate a few weeks ago.)
@pat-s wrote in #8865 (comment):
That's precisely why it would be most valuable to have a reproducer.
@pat-s wrote in #8865 (comment):
Let's wait a few days for someone to manifest themselves.
@ -0,0 +23,4 @@return true}func TestValidateAndBufferAsset(t *testing.T) {Does it fail without the change?
I hope the new test can serve as a proof.
@ -0,0 +34,4 @@return n, nil}func isS3Storage() bool {it returns true even if not S3
I ran the tests locally with the change reverted and it passes. Marking a
request changesuntil there is either a reproducer or report from other users to pursue this further.Confused about this. How can you run the new test with changes reverted? It only works with the new added function.
Also I don't think that no-test failures in current HEAD are a proof that everything is OK as the test might just not catch the effective real world behavior. As this is about race conditions between download/upload, this is also not straightforward to replicate imo.
Are you fully sure the local minio test setup reliably mimics a S3 upload scenario?
Anyhow, I am out of time and ideas for this one, also with respect to proofing this in tests. The test I added proofs that the new logic works per se.
All I can say that is I can reliably reproduce it on a v12 instance that uses S3 for asset storage.
As I am building from source anyhow, I can also live with my local fix for the time being.
#8906 is what I mean by reverting the change.
It does, there is no doubt.
I want to see that as well. Since the integration tests you added work with or without the fix, there is something else going on.
What can I do to see that problem for myself?
If you have a toy instance and access to S3-like storage, configure it to use it and run a migration on a repo which contains release assets (as these would be then pushed to the S3 bucket during the migration).
It works for me.
Migrated https://code.forgejo.org/forgejo/release-notes-assistant (with "Releases" box checked). It completed and I get:
@pat-s wrote in #8865 (comment):
I appreciate that your time is precious. My time as a reviewer is also precious.
When you file a bug with no way to reproduce it while claiming it can be reproduced, I trust your word and I spend time figuring out what is wrong. I will not doubt that the test your wrote does not reproduce the problem, I will first look for the problem. Only when I don't figure it out will I revert the change to pinpoint exactly where the tests fails and get a better understanding of the problem. And when the test still passes, demonstrating that, after all, it is not a reproducer, the time I spent looking was wasted. It is the same when you ask me to spin a Forgejo instance and do a migration, claiming this is a reliable reproducer. When I do that and it works for me, that is also time I wasted.
When you provide test that is reproducer, it saves you and I valuable time.
As a conclusion, I suggest closing this pull request and opening an issue with "need more information". It can sit there as long as it need for someone else to run into the same problem that you have. It will provide valuable insight as to which environmental conditions are responsible for this behavior.
Could you please do that?
@earl-warren
I apologize overall as I was like 99% sure that this is reproducible, given I tried to outrule the influence of the reverse proxy or other local env-related matters.
Turns out it wasn't (given your tests, which I highly appreciate).
I understand that such cases are never easy and yes, I couldn't verify it 100% and was just "very sure", which is not enough.
I am still at loss for the actual reason, as I checked on the whole topic ~ 3-4h before even opening this PR, outruling many other parts before even attempting to patch the source, which then eventually worked.
I understand that, yet I hope you wouldn't have to do everything from scratch but just make a "quick edit" to the app.ini and referencing a S3 bucket there. But yes, I also could have done that with another instance of mine. Again, apologies. The best outcome would have been if anyone else with an existing S3 could have tried a migration call and reported this here.
Anyhow, I am still at loss which component in the overall process flow would be the culprit here. I don't see another one besides the reverse proxy taking major action in it, maybe I did a bad job outruling its influence.
One (minor) issue I am facing now is: even though the issue is very likely env related, the fix proposed here would actually fix it (and also not break other existing functionality), so it could also be included, also because it has tests. It seems to be a more conservative approach to the process overall then the current implementation. Yes, I know that such action is potential dangerous as always something could break and that the patch proposed might also result in worse performance for this task, but yeah - the current implementation is at least more vulnerable to env-related settings which give admins a hard time to even relate to this issue in the first place (including potential frustration like "this software is not ready yet") (not saying that this would be my take on it, ofc).
Until I've figured out (if this will ever happen), which component might be causing this (e.g. a size-related constrain on the proxy), I am bound to carry this manual patch along with me in my instances, which is somewhat of a burden :/
A Message from an account created to circumvent a ban from Forgejo spaces were deleted, please get in touch with the moderation team for more information or checkout the moderation process.
Pull request closed