Skip to content

march: fix deadlock in --no-traverse when all transfers skipped#8647

Closed
jeremy wants to merge 1 commit into
rclone:masterfrom
basecamp:fix-no-traverse-deadlock
Closed

march: fix deadlock in --no-traverse when all transfers skipped#8647
jeremy wants to merge 1 commit into
rclone:masterfrom
basecamp:fix-no-traverse-deadlock

Conversation

@jeremy

@jeremy jeremy commented Jun 28, 2025

Copy link
Copy Markdown
Contributor

What is the purpose of this change?

When every source object already exists at the destination and --no-traverse is set, march stalled/deadlocked after 100 entries because dstChan lagged srcChan. Introduced in v1.70.0 with callback-based syncing.

Work-around by buffering all sources and destinations then replaying through balanced channels.

This fix is naive and regresses on the intent of the new algorithm, but gets things working again. There's probably a more sophisticated approach that preserves streaming and global ordering without stalls, e.g. using an intermediary goroutine to track latest "frontier" key for each worker, then only emitting keys to dstChan that are less than the smallest of all frontier keys.

I couldn't get a test case working reliably, but here's a script to reproduce:

#!/usr/bin/env bash
# Reproduces rclone v1.70+ deadlock with --no-traverse when all files exist
# Usage: ./repro.sh [rclone-binary-path]

# Use provided rclone binary or default to system rclone
RCLONE="${1:-rclone}"

echo "Using rclone: $RCLONE"
echo "Version: $($RCLONE version | head -1)"
echo

# Create test directories
src=$(mktemp -d)
dst=$(mktemp -d)

# Create 200 test files (enough to trigger the ~100 file deadlock)
echo "Creating 200 test files..."
for i in {1..200}; do
    echo "content$i" > "$src/file$i.txt"
done

# Initial copy to establish destination
echo "Initial copy to destination..."
$RCLONE copy "$src/" "$dst/" --quiet

# Create file list
filelist=$(mktemp)
ls "$src" > "$filelist"

# This command deadlocks in v1.70+ when all files exist at destination
echo "Testing --no-traverse with all files existing (10s timeout)..."
timeout 10s $RCLONE copy "$src/" "$dst/" \
    --no-traverse \
    --files-from "$filelist" \
    --ignore-existing \
    --progress

if [ $? -eq 124 ]; then
    echo -e "\n❌ FAIL: Command timed out (deadlock detected)"
    exit 1
else
    echo -e "\n✅ PASS: Command completed successfully"
fi

# Cleanup
rm -rf "$src" "$dst" "$filelist"

Results on 1.69.3, 1.70.2, and with this patch:

Using rclone: rclone
Version: rclone v1.69.3

Creating 200 test files...
Initial copy to destination...
Testing --no-traverse with all files existing (10s timeout)...
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:               200 / 200, 100%
Elapsed time:         0.0s

✅ PASS: Command completed successfully
Using rclone: rclone
Version: rclone v1.70.2

Creating 200 test files...
Initial copy to destination...
Testing --no-traverse with all files existing (10s timeout)...
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:                 0 / 0, -, Listed 200
Elapsed time:         9.5s
❌ FAIL: Command timed out (deadlock detected)
Using rclone: ./rclone
Version: rclone v1.71.0-DEV

Creating 200 test files...
Initial copy to destination...
Testing --no-traverse with all files existing (10s timeout)...
Transferred:   	          0 B / 0 B, -, 0 B/s, ETA -
Checks:               200 / 200, 100%, Listed 200
Elapsed time:         0.0s

✅ PASS: Command completed successfully

Was the change discussed in an issue or in the forum before?

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

When every source object already exists at the destination and
--no-traverse is set, march deadlocked after 100 entries because
dstChan lagged srcChan. Buffer all sources and destinations,
then replay them through balanced channels.

Introduced in v1.70.0 with callback-based syncing.
@ncw

ncw commented Jun 28, 2025

Copy link
Copy Markdown
Member

Thanks for finding this problem. Yes this code is all new to help with directories with 100,000,000 entries where we sort on disk.

Your code appears to revert to an in memory sort which is exactly what we are trying to avoid so I think we are going to need a different fix.

@jeremy

jeremy commented Jun 28, 2025

Copy link
Copy Markdown
Contributor Author

@ncw see #8648 for a forward-looking approach using sequence numbering to maintain global order. Tried implementing a min-heap approach but kept hitting subtle bugs.

@ncw

ncw commented Jul 2, 2025

Copy link
Copy Markdown
Member

I have taken your excellent writeup and test routine and created an issue from them here #8656

I'll close this PR since we both think it is the wrong approach.

@ncw ncw closed this Jul 2, 2025
@jeremy jeremy deleted the fix-no-traverse-deadlock branch July 2, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants