march: fix deadlock in --no-traverse when all transfers skipped#8647
Closed
jeremy wants to merge 1 commit into
Closed
march: fix deadlock in --no-traverse when all transfers skipped#8647jeremy wants to merge 1 commit into
jeremy wants to merge 1 commit into
Conversation
When every source object already exists at the destination and --no-traverse is set, march deadlocked after 100 entries because dstChan lagged srcChan. Buffer all sources and destinations, then replay them through balanced channels. Introduced in v1.70.0 with callback-based syncing.
Member
|
Thanks for finding this problem. Yes this code is all new to help with directories with 100,000,000 entries where we sort on disk. Your code appears to revert to an in memory sort which is exactly what we are trying to avoid so I think we are going to need a different fix. |
Closed
5 tasks
Contributor
Author
Member
|
I have taken your excellent writeup and test routine and created an issue from them here #8656 I'll close this PR since we both think it is the wrong approach. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the purpose of this change?
When every source object already exists at the destination and
--no-traverseis set, march stalled/deadlocked after 100 entries because dstChan lagged srcChan. Introduced in v1.70.0 with callback-based syncing.Work-around by buffering all sources and destinations then replaying through balanced channels.
This fix is naive and regresses on the intent of the new algorithm, but gets things working again. There's probably a more sophisticated approach that preserves streaming and global ordering without stalls, e.g. using an intermediary goroutine to track latest "frontier" key for each worker, then only emitting keys to dstChan that are less than the smallest of all frontier keys.
I couldn't get a test case working reliably, but here's a script to reproduce:
Results on 1.69.3, 1.70.2, and with this patch:
Using rclone: rclone Version: rclone v1.69.3 Creating 200 test files... Initial copy to destination... Testing --no-traverse with all files existing (10s timeout)... Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 200 / 200, 100% Elapsed time: 0.0s ✅ PASS: Command completed successfullyUsing rclone: rclone Version: rclone v1.70.2 Creating 200 test files... Initial copy to destination... Testing --no-traverse with all files existing (10s timeout)... Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 0 / 0, -, Listed 200 Elapsed time: 9.5s ❌ FAIL: Command timed out (deadlock detected)Using rclone: ./rclone Version: rclone v1.71.0-DEV Creating 200 test files... Initial copy to destination... Testing --no-traverse with all files existing (10s timeout)... Transferred: 0 B / 0 B, -, 0 B/s, ETA - Checks: 200 / 200, 100%, Listed 200 Elapsed time: 0.0s ✅ PASS: Command completed successfullyWas the change discussed in an issue or in the forum before?
Checklist