Skip to content

fix: resolve dangling dedup symlink infinite redownload loop#119

Merged
GeiserX merged 3 commits intomainfrom
fix/dangling-symlink-verify-media
Apr 18, 2026
Merged

fix: resolve dangling dedup symlink infinite redownload loop#119
GeiserX merged 3 commits intomainfrom
fix/dangling-symlink-verify-media

Conversation

@GeiserX
Copy link
Copy Markdown
Owner

@GeiserX GeiserX commented Apr 18, 2026

Summary

  • Fix VERIFY_MEDIA + DEDUPLICATE_MEDIA interaction where dangling symlinks cause infinite redownload loops with Errno 17 (file exists) and Errno 2 (file not found) errors
  • Three root causes fixed: os.path.exists() follows symlinks (now lexists()), Telethon download_media() return value was discarded (now captured), stale symlinks not removed before recreation (now pre-unlinked)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that causes existing functionality to change)
  • Documentation update
  • Infrastructure/CI change

Database Changes

  • Schema changes (Alembic migration required)
  • Data migration script added in scripts/
  • No database changes

Data Consistency Checklist

N/A — no database operations modified.

Testing

  • Tests pass locally (python -m pytest tests/ -v)
  • Linting passes (ruff check .)
  • Formatting passes (ruff format --check .)
  • Manually tested in development environment

9 new tests in tests/test_symlink_dedup.py covering:

  • Dangling symlink detection (lexists vs exists)
  • Stale symlink removal before recreation
  • Download return value capture for actual on-disk filenames
  • _process_media integration with dedup enabled
  • _verify_and_redownload_media cleanup of dangling symlinks
  • Listener _download_media symlink handling
  • shutil.copy2 fallback when shared file already exists

Security Checklist

  • No secrets or credentials committed
  • User input properly validated/sanitized
  • Authentication/authorization properly checked

Deployment Notes

  • No configuration changes required — fix is transparent for existing DEDUPLICATE_MEDIA + VERIFY_MEDIA users
  • Existing dangling symlinks will be auto-repaired on next backup cycle

Closes #115

Summary by CodeRabbit

  • Documentation

    • Updated changelog with media deduplication symlink handling fix details.
  • Bug Fixes

    • Improved media deduplication symlink handling when both DEDUPLICATE_MEDIA and VERIFY_MEDIA are enabled. The system now properly detects dangling symlinks, removes stale links before recreation, and prevents "file exists" errors in both scheduled backup and real-time listener flows.
  • Tests

    • Added comprehensive test suite for media deduplication and dangling symlink handling scenarios.

Use os.path.lexists() instead of os.path.exists() to detect dangling
symlinks during VERIFY_MEDIA cleanup. Capture Telethon download_media()
return value to use actual on-disk filename for symlink targets. Remove
stale symlinks before recreation to prevent Errno 17 (file exists).

Applies to both scheduled backup and real-time listener flows.

Closes #115
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

Warning

Rate limit exceeded

@GeiserX has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 50 minutes and 23 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 50 minutes and 23 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 38018cd7-6884-4742-a7a5-f7a4c08ad6d5

📥 Commits

Reviewing files that changed from the base of the PR and between 64f5162 and c804071.

📒 Files selected for processing (2)
  • src/telegram_backup.py
  • tests/test_symlink_dedup.py
📝 Walkthrough

Walkthrough

Fixes dangling symlink handling in media deduplication and verification workflows by detecting broken links with os.path.lexists(), capturing actual downloaded paths returned by download_media(), and removing stale symlinks before creating replacements to prevent EEXIST errors.

Changes

Cohort / File(s) Summary
Documentation
docs/CHANGELOG.md
Added changelog entry documenting the fix for dangling dedup symlinks in both scheduled backup and real-time listener flows.
Listener & Backup Media Handling
src/listener.py, src/telegram_backup.py
Updated _download_media() and _process_media() to use os.path.lexists() for symlink detection, capture and normalize actual paths returned by download_media(), pre-unlink existing paths before creating symlinks, and fall back to shutil.copy2/shutil.move on symlink creation failure. Updated _verify_and_redownload_media() to remove dangling symlinks before re-download.
Test Suite
tests/test_symlink_dedup.py
New comprehensive test suite validating dangling symlink detection and cleanup, deduplication symlink creation with path capture, symlink failure fallback behavior, and verification-time symlink removal across both backup and listener flows.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly summarizes the main fix: resolving dangling dedup symlink infinite redownload loops, matching the primary changeset objective.
Description check ✅ Passed The description covers all required template sections: summary, type of change, database changes, data consistency, testing, security, and deployment notes with appropriate selections and details.
Linked Issues check ✅ Passed Changes implement all three root fixes from issue #115: using os.path.lexists() for dangling symlink detection, capturing download_media() return values for actual filenames, and pre-unlinking stale symlinks before recreation.
Out of Scope Changes check ✅ Passed All changes directly address issue #115: docs update, listener.py symlink handling, telegram_backup.py dedup/verify logic, and comprehensive test coverage. No extraneous modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dangling-symlink-verify-media

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

🐳 Dev images published!

  • drumsergio/telegram-archive:dev
  • drumsergio/telegram-archive-viewer:dev

The dev/test instance will pick up these changes automatically (Portainer GitOps).

To test locally:

docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2026

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 44.56%. Comparing base (f3df668) to head (c804071).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/listener.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #119      +/-   ##
==========================================
+ Coverage   41.90%   44.56%   +2.66%     
==========================================
  Files          21       21              
  Lines        5847     5865      +18     
==========================================
+ Hits         2450     2614     +164     
+ Misses       3397     3251     -146     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Cover all changed lines in listener.py _download_media (dangling symlink
pre-unlink, copy2 fallback, download return capture for both dedup and
non-dedup paths) and telegram_backup.py _process_media (non-dedup return
capture, dedup symlink-fail fallback return capture).
@github-actions
Copy link
Copy Markdown

🐳 Dev images published!

  • drumsergio/telegram-archive:dev
  • drumsergio/telegram-archive-viewer:dev

The dev/test instance will pick up these changes automatically (Portainer GitOps).

To test locally:

docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev

@GeiserX
Copy link
Copy Markdown
Owner Author

GeiserX commented Apr 18, 2026

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 18, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/telegram_backup.py (1)

1349-1362: ⚠️ Potential issue | 🟠 Major

Pre-unlink stale chat symlinks in the shared-file branch too.

This branch still calls os.symlink() while a dangling file_path may already exist, so scheduled backups can still hit EEXIST. Mirror the listener/first-download path and remove the stale link before recreating it.

🐛 Proposed fix
                     if os.path.exists(shared_file_path):
                         # File exists in shared - create symlink
                         try:
                             # Use relative symlink for portability
                             rel_path = os.path.relpath(shared_file_path, chat_media_dir)
+                            if os.path.lexists(file_path):
+                                os.unlink(file_path)
                             os.symlink(rel_path, file_path)
                             logger.debug(f"Created symlink for deduplicated media: {file_name}")
                         except OSError as e:
-                            # Symlink failed (e.g., Windows), copy reference instead
-                            logger.warning(f"Symlink failed, downloading copy: {e}")
-                            actual_path = await self.client.download_media(message, file_path)
-                            if actual_path and isinstance(actual_path, str):
-                                file_path = actual_path
+                            # Symlink not supported (e.g., Windows), copy shared file instead
+                            logger.warning(f"Symlink not supported, using direct path: {e}")
+                            import shutil
+
+                            shutil.copy2(shared_file_path, file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 1349 - 1362, The shared-file branch
should pre-unlink any stale/dangling symlink at file_path before calling
os.symlink to avoid EEXIST; update the block handling shared_file_path so that
if os.path.exists(file_path) is True and os.path.islink(file_path) and not
os.path.exists(os.readlink(file_path) if os.path.islink(file_path) else ""), or
more simply if os.path.islink(file_path) and not os.path.exists(file_path), call
os.unlink(file_path) (mirroring the listener/first-download path) prior to
creating the relative symlink with os.symlink(rel_path, file_path); keep the
existing exception handling that falls back to
self.client.download_media(message, file_path) and the same logger calls
(logger.debug/logger.warning) and references to file_name.
🧹 Nitpick comments (1)
tests/test_symlink_dedup.py (1)

145-155: Set reply_to = None on MagicMock messages.

Add this to the message helpers/ad-hoc message mocks to avoid accidental topic-filtering behavior as these tests evolve. As per coding guidelines, “Always set msg.reply_to = None on MagicMock message objects to prevent false-positive topic filtering in tests”.

Also applies to: 194-204, 520-528, 716-725, 779-788

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_symlink_dedup.py` around lines 145 - 155, The MagicMock message
instances (e.g., the local variable msg used in tests/test_symlink_dedup.py and
its helper mocks) must explicitly set msg.reply_to = None to avoid accidental
topic-filtering; update each mock block where msg is created (the block with
msg.id, msg.date, msg.media and related return-value mocks and any similar
blocks at the other noted locations) to include msg.reply_to = None immediately
after instantiating msg.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/listener.py`:
- Around line 638-644: The final return currently ignores the possibly-updated
file_path from the download step; update the return to use the captured
file_path (or its relative path under self.config.media_path) instead of always
constructing f"{self.config.media_path}/{chat_id}/{file_name}"; specifically,
after calling self.client.download_media(...) and possibly getting an
actual_path string, return that actual_path (or os.path.relpath(actual_path,
self.config.media_path) if you must store a path relative to the media root) so
the DB records the real file Telethon wrote rather than the originally requested
file_name.

In `@tests/test_symlink_dedup.py`:
- Around line 162-166: The test currently only conditionally checks the symlink
on disk (using os.path.lexists) and compares the resolved path, which allows
download_media() to be ignored; instead assert the function's returned/stored
path directly: remove the if os.path.lexists(...) guard and replace with a
direct equality assertion that actual_returned_path (the value returned by
download_media()) equals the expected resolved path for chat_link (use
os.path.realpath on both sides if needed). Apply the same change pattern to the
other occurrences referenced (around lines 230-234, 619-623, 686-688, 748-752,
822-824) so each test tightens the assertion to validate the captured/returned
path instead of only checking existence.
- Around line 373-380: Add an assertion that the dangling symlink was actually
removed during _verify_and_redownload_media: after the existing assertions that
_process_media and db.insert_media were awaited, assert that os.path.lexists was
called for the problematic symlink and that os.remove was called with that
symlink path (i.e. verify os.path.lexists(...)=True check and
os.remove(symlink_path) invocation), referencing the test's call to
self.backup._verify_and_redownload_media(), self.backup._process_media,
self.backup.db.insert_media, os.path.lexists, and os.remove to ensure the
cleanup step ran rather than only the re-download.

---

Outside diff comments:
In `@src/telegram_backup.py`:
- Around line 1349-1362: The shared-file branch should pre-unlink any
stale/dangling symlink at file_path before calling os.symlink to avoid EEXIST;
update the block handling shared_file_path so that if os.path.exists(file_path)
is True and os.path.islink(file_path) and not
os.path.exists(os.readlink(file_path) if os.path.islink(file_path) else ""), or
more simply if os.path.islink(file_path) and not os.path.exists(file_path), call
os.unlink(file_path) (mirroring the listener/first-download path) prior to
creating the relative symlink with os.symlink(rel_path, file_path); keep the
existing exception handling that falls back to
self.client.download_media(message, file_path) and the same logger calls
(logger.debug/logger.warning) and references to file_name.

---

Nitpick comments:
In `@tests/test_symlink_dedup.py`:
- Around line 145-155: The MagicMock message instances (e.g., the local variable
msg used in tests/test_symlink_dedup.py and its helper mocks) must explicitly
set msg.reply_to = None to avoid accidental topic-filtering; update each mock
block where msg is created (the block with msg.id, msg.date, msg.media and
related return-value mocks and any similar blocks at the other noted locations)
to include msg.reply_to = None immediately after instantiating msg.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8aac0c38-766e-42b8-9fea-a588b2c0bc37

📥 Commits

Reviewing files that changed from the base of the PR and between f3df668 and 64f5162.

📒 Files selected for processing (4)
  • docs/CHANGELOG.md
  • src/listener.py
  • src/telegram_backup.py
  • tests/test_symlink_dedup.py

Comment thread src/listener.py
Comment on lines 638 to 644
if not os.path.exists(file_path):
await self.client.download_media(message, file_path)
actual_path = await self.client.download_media(message, file_path)
if actual_path and isinstance(actual_path, str):
file_path = actual_path

# Return the path as stored in DB (relative to media root)
return f"{self.config.media_path}/{chat_id}/{file_name}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return file_path after capturing Telethon’s actual path.

Line 644 discards the updated file_path from Lines 639-641. In non-dedup mode, if Telethon writes photo.mp4 instead of the requested photo, the DB still stores the non-existent requested path.

🐛 Proposed fix
             # Return the path as stored in DB (relative to media root)
-            return f"{self.config.media_path}/{chat_id}/{file_name}"
+            return file_path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not os.path.exists(file_path):
await self.client.download_media(message, file_path)
actual_path = await self.client.download_media(message, file_path)
if actual_path and isinstance(actual_path, str):
file_path = actual_path
# Return the path as stored in DB (relative to media root)
return f"{self.config.media_path}/{chat_id}/{file_name}"
if not os.path.exists(file_path):
actual_path = await self.client.download_media(message, file_path)
if actual_path and isinstance(actual_path, str):
file_path = actual_path
# Return the path as stored in DB (relative to media root)
return file_path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/listener.py` around lines 638 - 644, The final return currently ignores
the possibly-updated file_path from the download step; update the return to use
the captured file_path (or its relative path under self.config.media_path)
instead of always constructing
f"{self.config.media_path}/{chat_id}/{file_name}"; specifically, after calling
self.client.download_media(...) and possibly getting an actual_path string,
return that actual_path (or os.path.relpath(actual_path, self.config.media_path)
if you must store a path relative to the media root) so the DB records the real
file Telethon wrote rather than the originally requested file_name.

Comment thread tests/test_symlink_dedup.py
Comment on lines +373 to +380
self._run(self.backup._verify_and_redownload_media())

# The cleanup code should have removed the dangling symlink before re-download
# (lexists check + os.remove in the verification loop)
# Verify _process_media was called (re-download attempted)
self.backup._process_media.assert_awaited_once()
# Verify db.insert_media was called (successful re-download)
self.backup.db.insert_media.assert_awaited_once()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert the dangling symlink cleanup happened.

This test currently only proves redownload was attempted. If the os.path.lexists() cleanup regresses, the mocked _process_media() still lets the test pass.

🧪 Proposed assertion
         self._run(self.backup._verify_and_redownload_media())
 
+        self.assertFalse(os.path.lexists(dangling_link))
         # The cleanup code should have removed the dangling symlink before re-download
         # (lexists check + os.remove in the verification loop)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._run(self.backup._verify_and_redownload_media())
# The cleanup code should have removed the dangling symlink before re-download
# (lexists check + os.remove in the verification loop)
# Verify _process_media was called (re-download attempted)
self.backup._process_media.assert_awaited_once()
# Verify db.insert_media was called (successful re-download)
self.backup.db.insert_media.assert_awaited_once()
self._run(self.backup._verify_and_redownload_media())
self.assertFalse(os.path.lexists(dangling_link))
# The cleanup code should have removed the dangling symlink before re-download
# (lexists check + os.remove in the verification loop)
# Verify _process_media was called (re-download attempted)
self.backup._process_media.assert_awaited_once()
# Verify db.insert_media was called (successful re-download)
self.backup.db.insert_media.assert_awaited_once()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_symlink_dedup.py` around lines 373 - 380, Add an assertion that
the dangling symlink was actually removed during _verify_and_redownload_media:
after the existing assertions that _process_media and db.insert_media were
awaited, assert that os.path.lexists was called for the problematic symlink and
that os.remove was called with that symlink path (i.e. verify
os.path.lexists(...)=True check and os.remove(symlink_path) invocation),
referencing the test's call to self.backup._verify_and_redownload_media(),
self.backup._process_media, self.backup.db.insert_media, os.path.lexists, and
os.remove to ensure the cleanup step ran rather than only the re-download.

Add lexists+unlink guard before os.symlink in the shared-file-exists
branch of _process_media (was missing, could still hit EEXIST). Change
fallback from re-downloading to shutil.copy2 for consistency with
listener. Set msg.reply_to=None on all test mock messages per project
convention.
@github-actions
Copy link
Copy Markdown

🐳 Dev images published!

  • drumsergio/telegram-archive:dev
  • drumsergio/telegram-archive-viewer:dev

The dev/test instance will pick up these changes automatically (Portainer GitOps).

To test locally:

docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev

@GeiserX GeiserX merged commit aafb7a9 into main Apr 18, 2026
10 checks passed
@GeiserX GeiserX deleted the fix/dangling-symlink-verify-media branch April 18, 2026 12:59
GeiserX pushed a commit that referenced this pull request May 7, 2026
`_verify_and_redownload_media` checks every previously recorded media
file with `os.path.exists`, which follows symlink targets. For archived
layouts where the actual content is held in an external store
(git-annex `objects/`, lazy mount, separate filesystem) the symlink may
be intact while its target is unreachable from the running process.
The old behavior flagged such entries as "missing" and re-downloaded,
atomic-renaming a freshly-downloaded regular file on top of the user's
symlink and mutating the working tree (issue #143).

Two-part fix in the Phase 1 detection loop:

1. Use `os.path.lexists` to detect "truly missing" entries. A symlink
   in place -- even a broken one -- is no longer added to the
   missing-files list.
2. After the lexists check, short-circuit on `os.path.islink`. Symlink
   contents are managed externally, so we cannot meaningfully verify
   size or emptiness without following the link, and we will not
   replace a symlink we did not write.

Truly absent paths (no entry on disk at all) are still re-downloaded.
The cleanup loop in Phase 2 already used `os.path.lexists` since #119,
so the two phases are now consistent.

Refs: #143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: VERIFY_MEDIA redownload fails on dangling dedup symlinks (Errno 17 + Errno 2)

1 participant