Skip to content

march: fix deadlock in --no-traverse when all transfers skipped using sequence numbering#8648

Closed
jeremy wants to merge 1 commit into
rclone:masterfrom
basecamp:no-traverse-seqnum
Closed

march: fix deadlock in --no-traverse when all transfers skipped using sequence numbering#8648
jeremy wants to merge 1 commit into
rclone:masterfrom
basecamp:no-traverse-seqnum

Conversation

@jeremy

@jeremy jeremy commented Jun 28, 2025

Copy link
Copy Markdown
Contributor

What is the purpose of this change?

When every source object already exists at the destination and --no-traverse is set, march deadlocked after 100 entries because dstChan lagged srcChan. Buffer all sources and destinations, then replay them through balanced channels. Introduced in v1.70.0 with callback-based syncing.

See #8647 for the revert-to-old-behavior fix and repro script. This PR attempts a fix that retains streaming behavior.

  • Replaces slice-accumulating sorter with O(checkers) streaming
  • Each incoming src entry gets a monotonic sequence number; a map buffers out-of-order destination HEAD results so dstChan is replayed in sequence order
  • Directories get a zero-cost placeholder to guarantee 1:1 src/dst event flow without blocking

Using repro script from #8647:

Using rclone: ./rclone
Version: rclone v1.71.0-DEV

Creating 200 test files...
Initial copy to destination...
Testing --no-traverse with all files existing (10s timeout)...
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:               200 / 200, 100%, Listed 200
Elapsed time:         0.0s

✅ PASS: Command completed successfully

Was the change discussed in an issue or in the forum before?

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

When every source object already exists at the destination and
--no-traverse is set, march deadlocked after 100 entries because
dstChan lagged srcChan. Buffer all sources and destinations,
then replay them through balanced channels.

Introduced in v1.70.0 with callback-based syncing.

* Replaces slice-accumulating sorter with O(checkers) streaming
* Each incoming src entry gets a monotonic sequence number; a map
  buffers out-of-order destination HEAD results so dstChan is
  replayed in sequence order
* Directories get a zero-cost placeholder to guarantee 1:1 src/dst
  event flow without blocking

@ncw ncw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a correct way of doing the fix, but it looks a lot more complicated than it needs to be.

I also have a small concern about the seq overflowing. It should be int64 as more than 4 billion files in a directory is not unheard of on s3!

Anyway I decided to see if I could come up with a simpler approach.

This is what I came up with - I would appreciate your feedback. The basic idea is to keep two channels, one for the workers and one for the output but instead of passing objects about, pass a chan to write/read the object. This then allows the workers to run concurrently and keep the output in order. I've explained that really badly - hopefully you'll find the code clearer!

f41ac33

v1.71.0-beta.8831.f41ac3342.fix-8656-march-deadlock on branch fix-8656-march-deadlock (uploaded in 15-30 mins)

@jeremy

jeremy commented Jul 2, 2025

Copy link
Copy Markdown
Contributor Author

Nice approach! Dig passing the "slot" channel down the pipeline so we get ordering for free, by construction. No notes.

@jeremy jeremy closed this Jul 2, 2025
@jeremy jeremy deleted the no-traverse-seqnum branch July 2, 2025 20:41
@ncw

ncw commented Jul 3, 2025

Copy link
Copy Markdown
Member

Nice approach! Dig passing the "slot" channel down the pipeline so we get ordering for free, by construction. No notes.

Thanks for the review @jeremy . Sorry about not merging your code - I started of noodling with the chan of chans idea to give you feedback on a simpler implementation but I couldn't get it to work and by the time I did get it to work I had implemented it completely!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants