feat: content-hash dedup for cross-type media duplicates by GeiserX · Pull Request #139 · GeiserX/Telegram-Archive

GeiserX · 2026-05-03T20:11:16Z

Summary

Adds SHA-256 content hashing after media download to detect identical binary content across different Telegram file IDs (e.g., same video sent as streaming vs document)
New content_hash column on the media table with Alembic migration 011 (idempotent, supports both PostgreSQL and SQLite)
Content-hash dedup integrated in both scheduled backup (_process_media) and real-time listener (_download_media) flows
entrypoint.sh stamping logic updated for both DB engines

How it works

After downloading a new file to _shared/, compute its SHA-256 hash
Look up the hash in the media table via find_media_by_content_hash()
If a match exists and the shared file is present on disk, delete the new download and symlink to the existing file instead
Store the content_hash on every media record for future lookups

Test plan

All 1163 non-web tests pass locally
ruff check . and ruff format --check . pass
CI lint + test workflows pass
Verify migration runs cleanly on fresh PostgreSQL database
Verify migration runs cleanly on fresh SQLite database
Verify pre-Alembic database stamps correctly at version 011 when content_hash column exists

Closes #138

Summary by CodeRabbit

Release Notes

New Features
- Added content hash-based deduplication for media files to prevent storing duplicate content across different downloads.
- Improved database migration version detection for better upgrade handling.
Tests
- Updated media handling tests to reflect new deduplication behavior.

coderabbitai · 2026-05-03T20:11:45Z

Warning

Rate limit exceeded

@GeiserX has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 560f85f0-45bc-4fe9-b5e8-f5c1d9aa25d4

📥 Commits

Reviewing files that changed from the base of the PR and between 29a1c50 and ce35180.

⛔ Files ignored due to path filters (1)

alembic/versions/20260503_011_add_media_content_hash.py is excluded by !alembic/versions/**

📒 Files selected for processing (10)

.coderabbit.yaml
CLAUDE.md
scripts/entrypoint.sh
src/db/adapter.py
src/db/models.py
src/listener.py
src/message_utils.py
src/telegram_backup.py
tests/test_listener_extended.py
tests/test_symlink_dedup.py

📝 Walkthrough

Walkthrough

This PR implements content-hash based media deduplication to prevent re-downloading identical files with different message IDs. It adds a content_hash field to the Media model, computes SHA-256 hashes for downloaded files, and uses them to reuse existing media instead of creating duplicates.

Changes

Content-Hash Media Deduplication

Layer / File(s)	Summary
Database Schema `src/db/models.py`	Added `content_hash: Mapped[str
Database Adapter `src/db/adapter.py`	New `find_media_by_content_hash` method queries media by hash; `insert_media` now stores the `content_hash` field.
Migration Detection `scripts/entrypoint.sh`, `.coderabbit.yaml`	Entrypoint detects migration `011` schema (presence of `media.content_hash`) for both PostgreSQL and SQLite to stamp alembic version correctly. Configuration updated for knowledge base file patterns.
Hash Computation `src/message_utils.py`	New `compute_file_hash` function computes SHA-256 hex digest of files in 64 KB chunks, handling symlinks and `OSError` gracefully.
Deduplication Logic `src/listener.py`, `src/telegram_backup.py`	`_download_media` now returns `(file_path, content_hash)` tuple and checks `find_media_by_content_hash` to reuse existing shared files when content is identical. Media records include computed `content_hash`.
Tests `tests/test_listener_extended.py`	Mock for `_download_media` updated to return tuple `(file_path, content_hash)` instead of string.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: resolve dangling dedup symlink infinite redownload loop #119: Modifies media deduplication logic in src/listener.py and src/telegram_backup.py and addresses symlink handling and download return-path capture.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main change: adding content-hash-based deduplication for media across different Telegram file types.
Description check	✅ Passed	The description covers the key aspects: the feature's purpose, implementation approach, database changes with migration details, and a test plan. All major required sections are addressed.
Linked Issues check	✅ Passed	The PR fully implements the requirements from `#138`: SHA-256 content hashing for deduplication, symlink logic to avoid duplicate storage, and support for SQLite deployments with the DEDUPLICATE_MEDIA flag.
Out of Scope Changes check	✅ Passed	All code changes align with the stated objectives. The config update to .coderabbit.yaml and exception handling fix in entrypoint are minor supporting changes that enable the core functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/content-hash-dedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 0/1 reviews remaining, refill in 33 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

src/telegram_backup.py (3)

1509-1533: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hash direct downloads too.

When deduplicate_media is false, this path still writes content_hash=None. That leaves scheduled-backup rows inconsistent with the listener flow and prevents later hash-based reuse if dedup is turned back on. Compute the hash from file_path before building media_data.

Proposed fix

             else:
                 # No deduplication - download directly to chat directory
                 if not os.path.exists(file_path):
                     actual_path = await self.client.download_media(message, file_path)
                     if actual_path and isinstance(actual_path, str):
                         file_path = actual_path
                     logger.debug(f"Downloaded media: {file_name}")
 
                 # Update file_size with actual size from disk
                 if os.path.exists(file_path):
                     file_size = os.path.getsize(file_path)
+                    content_hash = compute_file_hash(file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 1509 - 1533, When deduplicate_media is
False the code leaves content_hash as None; after the download branch that may
update file_path (the await self.client.download_media(...) block) compute the
content hash from the downloaded file_path (only if the file exists) and set
content_hash before constructing media_data so scheduled-backup rows include the
actual hash; update the logic around file_path/content_hash prior to the
media_data dict creation (referencing file_path, content_hash, download_media /
self.client.download_media, and media_data).

1477-1501: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't move a reused shared file out of _shared.

After a content-hash hit, shared_file_path may point at an older canonical blob. If os.symlink() then fails, shutil.move(shared_file_path, file_path) removes that canonical file from _shared, which can leave existing symlinks dangling and makes later dedup miss. Copy in this branch, or only move files created by this call. The same fallback exists in src/listener.py.

Proposed fix

+                        reused_existing_shared = False
                         content_hash = compute_file_hash(shared_file_path)
                         if content_hash and hasattr(self, "db"):
                             existing = await self.db.find_media_by_content_hash(content_hash)
                             if existing and existing.get("file_name"):
                                 existing_shared = os.path.join(shared_dir, existing["file_name"])
                                 if os.path.exists(existing_shared) and existing_shared != shared_file_path:
                                     os.remove(shared_file_path)
                                     shared_file_path = existing_shared
+                                    reused_existing_shared = True
                                     logger.debug(
                                         f"Content-hash dedup: {file_name} matches existing {existing['file_name']}"
                                     )
@@
                         except OSError as e:
                             # Symlink not supported (e.g., Windows) - move file to chat dir instead
                             logger.warning(f"Symlink not supported, using direct path: {e}")
                             import shutil
 
-                            shutil.move(shared_file_path, file_path)
+                            if reused_existing_shared:
+                                shutil.copy2(shared_file_path, file_path)
+                            else:
+                                shutil.move(shared_file_path, file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 1477 - 1501, The fallback branch
currently moves shared_file_path into file_path which deletes canonical blobs
when a content-hash hit reused an existing file; change the fallback to copy
instead of move when shared_file_path points into the shared_dir or when the
file was resolved via db.find_media_by_content_hash (i.e. after
compute_file_hash and existing lookup), so that existing canonical files are
preserved; only perform shutil.move if the file was created by this call
(detectable by a local-temp path or a flag you set when writing the file),
otherwise use shutil.copy2(shared_file_path, file_path) and preserve
permissions/mtime.

101-104: ⚠️ Potential issue | 🔴 Critical

Fix Python 3 exception syntax at line 103.

except ValueError, TypeError: is invalid Python 3 syntax and prevents the module from importing. Change to tuple syntax: except (ValueError, TypeError):.

Proposed fix

-    except ValueError, TypeError:
+    except (ValueError, TypeError):
         log_threshold_seconds = 10

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 101 - 104, The except clause in the
log_threshold_seconds parsing block uses Python 2 syntax (`except ValueError,
TypeError:`) which breaks imports; update the exception handling in that
try/except around int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches
both exceptions using tuple syntax `except (ValueError, TypeError):`, leaving
the fallback assignment to log_threshold_seconds = 10 unchanged.

src/listener.py (1)

910-923: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist file_name for listener-created media rows.

find_media_by_content_hash() rebuilds the canonical _shared path from Media.file_name. These inserts leave file_name null, so a listener-downloaded row can be selected for a hash hit and then treated as unusable, which makes later backup/listener dedup miss. Return file_name from _download_media() and store it here.

Proposed fix

-    async def _download_media(self, message, chat_id: int) -> tuple[str, str | None] | None:
+    async def _download_media(self, message, chat_id: int) -> tuple[str, str, str | None] | None:
@@
-            return f"{self.config.media_path}/{chat_id}/{file_name}", file_content_hash
+            return f"{self.config.media_path}/{chat_id}/{file_name}", file_name, file_content_hash
@@
-                            if download_result:
-                                media_path, content_hash = download_result
+                            if download_result:
+                                media_path, file_name, content_hash = download_result
                                 # Create media record (FK to messages now satisfied)
                                 media_id = f"{chat_id}_{message.id}_{media_type}"
                                 await self.db.insert_media(
                                     {
                                         "id": media_id,
                                         "message_id": message.id,
                                         "chat_id": chat_id,
                                         "type": media_type,
+                                        "file_name": file_name,
                                         "file_path": media_path,
                                         "content_hash": content_hash,
                                         "downloaded": True,
                                         "download_date": datetime.utcnow(),
                                     }
                                 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/listener.py` around lines 910 - 923, The listener inserts downloaded
media rows without setting Media.file_name which breaks
find_media_by_content_hash; update _download_media to return file_name (in
addition to media_path and content_hash), change the unpacking of
download_result in listener.py to media_path, content_hash, file_name, and
include "file_name": file_name in the dict passed to self.db.insert_media (keep
media_id, message_id, chat_id, type, file_path, content_hash, downloaded as-is)
so listener-created rows have file_name populated for canonical path
reconstruction.

tests/test_listener_extended.py (1)

1612-1638: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add content_hash assertion to complete the coverage for this feature.

The mock now returns (path, content_hash) but the test only asserts on file_path and downloaded. If the listener silently drops the hash when calling insert_media, this test won't catch it — and persisting content_hash is the core objective of this PR.
✅ Proposed assertion addition
 media_data = db.insert_media.call_args[0][0]
 assert media_data["file_path"] == "/tmp/media/-100/photo.jpg"
 assert media_data["downloaded"] is True
+assert media_data.get("content_hash") is None  # None passed from mock tuple's second element
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_listener_extended.py` around lines 1612 - 1638, The test mocks
listener._download_media to return ("/tmp/media/-100/photo.jpg",
"expected_hash") but never asserts that the listener forwards the content_hash
into db.insert_media; update the test after calling handler(event) to check
media_data = db.insert_media.call_args[0][0] and add an assertion that
media_data["content_hash"] == "expected_hash" (ensuring the mocked content_hash
value is used), leaving the existing assertions for file_path and downloaded
intact and referencing listener._download_media, handler, and db.insert_media to
locate the relevant lines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.coderabbit.yaml:
- Around line 63-64: Remove the duplicate list entry "CLAUDE.md" so the YAML
list contains a single unique "CLAUDE.md" line; locate the duplicate strings
"CLAUDE.md" in the list and delete one of them (leaving one entry), ensuring
list formatting/indentation remains valid YAML.

In `@src/db/adapter.py`:
- Around line 677-689: The current find_media_by_content_hash lookup is
vulnerable to race conditions because both downloaders write a _shared file then
run this unlocked SELECT; instead implement a hash-scoped canonicalization
around winner selection: in find_media_by_content_hash (or a new helper used by
both download paths) acquire a lock keyed by Media.content_hash (e.g., DB
advisory lock or a dedicated canonicalization table row locked with SELECT ...
FOR UPDATE inside a transaction), then re-check/select the Media row with
Media.content_hash and Media.downloaded inside that lock/transaction; if absent,
allow the caller designated as winner to insert or upsert a canonical Media
record (using INSERT ... ON CONFLICT or an explicit insert-after-lock) and
return its file_path/file_name/content_hash, otherwise return the existing
row—ensure both download paths use this locked canonicalization function so only
one shared file is adopted per content_hash.

---

Outside diff comments:
In `@src/listener.py`:
- Around line 910-923: The listener inserts downloaded media rows without
setting Media.file_name which breaks find_media_by_content_hash; update
_download_media to return file_name (in addition to media_path and
content_hash), change the unpacking of download_result in listener.py to
media_path, content_hash, file_name, and include "file_name": file_name in the
dict passed to self.db.insert_media (keep media_id, message_id, chat_id, type,
file_path, content_hash, downloaded as-is) so listener-created rows have
file_name populated for canonical path reconstruction.

In `@src/telegram_backup.py`:
- Around line 1509-1533: When deduplicate_media is False the code leaves
content_hash as None; after the download branch that may update file_path (the
await self.client.download_media(...) block) compute the content hash from the
downloaded file_path (only if the file exists) and set content_hash before
constructing media_data so scheduled-backup rows include the actual hash; update
the logic around file_path/content_hash prior to the media_data dict creation
(referencing file_path, content_hash, download_media /
self.client.download_media, and media_data).
- Around line 1477-1501: The fallback branch currently moves shared_file_path
into file_path which deletes canonical blobs when a content-hash hit reused an
existing file; change the fallback to copy instead of move when shared_file_path
points into the shared_dir or when the file was resolved via
db.find_media_by_content_hash (i.e. after compute_file_hash and existing
lookup), so that existing canonical files are preserved; only perform
shutil.move if the file was created by this call (detectable by a local-temp
path or a flag you set when writing the file), otherwise use
shutil.copy2(shared_file_path, file_path) and preserve permissions/mtime.
- Around line 101-104: The except clause in the log_threshold_seconds parsing
block uses Python 2 syntax (`except ValueError, TypeError:`) which breaks
imports; update the exception handling in that try/except around
int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches both exceptions
using tuple syntax `except (ValueError, TypeError):`, leaving the fallback
assignment to log_threshold_seconds = 10 unchanged.

In `@tests/test_listener_extended.py`:
- Around line 1612-1638: The test mocks listener._download_media to return
("/tmp/media/-100/photo.jpg", "expected_hash") but never asserts that the
listener forwards the content_hash into db.insert_media; update the test after
calling handler(event) to check media_data = db.insert_media.call_args[0][0] and
add an assertion that media_data["content_hash"] == "expected_hash" (ensuring
the mocked content_hash value is used), leaving the existing assertions for
file_path and downloaded intact and referencing listener._download_media,
handler, and db.insert_media to locate the relevant lines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df26cb70-2164-4b3c-ac5e-4660b67a2cdd

📥 Commits

Reviewing files that changed from the base of the PR and between 91ccfd3 and 29a1c50.

⛔ Files ignored due to path filters (1)

alembic/versions/20260503_011_add_media_content_hash.py is excluded by !alembic/versions/**

📒 Files selected for processing (9)

.coderabbit.yaml
CLAUDE.md
scripts/entrypoint.sh
src/db/adapter.py
src/db/models.py
src/listener.py
src/message_utils.py
src/telegram_backup.py
tests/test_listener_extended.py

coderabbitai · 2026-05-03T20:18:07Z

+      - "CLAUDE.md"
      - "CLAUDE.md"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate CLAUDE.md entry.

Line 64 duplicates line 63.

🔧 Proposed fix

filePatterns: - "CLAUDE.md" - - "CLAUDE.md"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- "CLAUDE.md"

- "CLAUDE.md"

filePatterns:

- "CLAUDE.md"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.coderabbit.yaml around lines 63 - 64, Remove the duplicate list entry "CLAUDE.md" so the YAML list contains a single unique "CLAUDE.md" line; locate the duplicate strings "CLAUDE.md" in the list and delete one of them (leaving one entry), ensuring list formatting/indentation remains valid YAML.

coderabbitai · 2026-05-03T20:18:07Z

+    async def find_media_by_content_hash(self, content_hash: str) -> dict[str, Any] | None:
+        """Find an existing downloaded media record with the given SHA-256 content hash."""
+        async with self.db_manager.async_session_factory() as session:
+            stmt = select(Media).where(and_(Media.content_hash == content_hash, Media.downloaded == 1)).limit(1)
+            result = await session.execute(stmt)
+            media = result.scalar_one_or_none()
+            if media is None:
+                return None
+            return {
+                "file_path": media.file_path,
+                "file_name": media.file_name,
+                "content_hash": media.content_hash,
+            }


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Hash lookup alone won't stop concurrent duplicate blobs.

Both download paths write a _shared/... file first and only then call this unlocked SELECT ... LIMIT 1. If the backup and listener download the same bytes at the same time, they can both miss here and keep separate shared files under different Telegram IDs, so disk dedup still fails. This needs a hash-scoped lock or another canonicalization step around winner selection, not just a post-download lookup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/db/adapter.py` around lines 677 - 689, The current find_media_by_content_hash lookup is vulnerable to race conditions because both downloaders write a _shared file then run this unlocked SELECT; instead implement a hash-scoped canonicalization around winner selection: in find_media_by_content_hash (or a new helper used by both download paths) acquire a lock keyed by Media.content_hash (e.g., DB advisory lock or a dedicated canonicalization table row locked with SELECT ... FOR UPDATE inside a transaction), then re-check/select the Media row with Media.content_hash and Media.downloaded inside that lock/transaction; if absent, allow the caller designated as winner to insert or upsert a canonical Media record (using INSERT ... ON CONFLICT or an explicit insert-after-lock) and return its file_path/file_name/content_hash, otherwise return the existing row—ensure both download paths use this locked canonicalization function so only one shared file is adopted per content_hash.

Telegram assigns different file IDs when the same file is sent via different methods (streaming vs document). The existing filename-based dedup misses these. SHA-256 content hashing after download catches identical binary content regardless of Telegram file ID. - Add content_hash column to Media model with index - Add Alembic migration 011 (idempotent, PostgreSQL + SQLite) - Add compute_file_hash() utility in message_utils - Add find_media_by_content_hash() DB lookup - Integrate content-hash dedup in both backup and listener flows - Update entrypoint.sh stamping logic for migration 011 Closes #138

- Move content-hash dedup into shared deduplicate_shared_file() in message_utils.py (DRY: both backup and listener use it) - Add path traversal guard via realpath containment check - Handle TOCTOU race on os.remove with FileNotFoundError catch - Return reused flag so callers use copy2 instead of move for canonical blobs (prevents destroying shared store entries) - Return file_name from listener's _download_media for DB consistency - Compute content_hash in backup's no-dedup branch for completeness - Fix test mocks to supply db.find_media_by_content_hash

github-actions · 2026-05-03T20:39:29Z

🐳 Dev images published!

drumsergio/telegram-archive:dev
drumsergio/telegram-archive-viewer:dev

The dev/test instance will pick up these changes automatically (Portainer GitOps).

To test locally:

docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev

codecov · 2026-05-03T20:40:05Z

Codecov Report

❌ Patch coverage is 66.19718% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.07%. Comparing base (91ccfd3) to head (ce35180).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/message_utils.py	60.60%	13 Missing ⚠️
src/db/adapter.py	12.50%	7 Missing ⚠️
src/listener.py	89.47%	2 Missing ⚠️
src/telegram_backup.py	80.00%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (66.19%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
- Coverage   94.41%   94.07%   -0.34%     
==========================================
  Files          21       21              
  Lines        6066     6127      +61     
==========================================
+ Hits         5727     5764      +37     
- Misses        339      363      +24

Files with missing lines	Coverage Δ
src/db/models.py	`100.00% <100.00%> (ø)`
src/listener.py	`97.99% <89.47%> (-0.26%)`	⬇️
src/telegram_backup.py	`96.10% <80.00%> (-0.16%)`	⬇️
src/db/adapter.py	`88.78% <12.50%> (-0.76%)`	⬇️
src/message_utils.py	`67.50% <60.60%> (-32.50%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The "do I already have this media?" check in `_process_media` and the listener's `_download_media` used `os.path.exists`, which follows the symlink chain to its ultimate target. For archived layouts where media files are symlinks (e.g. into `.git/annex/objects/...`), the target may be unreachable from the running process -- typical when the Docker / Podman bind mount only covers the working tree but not `.git/`. In that case `os.path.exists` returns False even though the symlink is intact, and the script enters the "first-time download" branch: 1. `_shared/<name>` is overwritten via atomic rename (.part -> .), replacing a git-annex symlink with a freshly-downloaded regular file (visible as `typechange:` in `git status`). 2. Content-hash dedup (#139) then queries the DB for a SHA-256 match and may rewrite the chat-dir symlink to point at a different canonical blob processed earlier in the same run -- a non- deterministic mutation of the working tree across reruns. Switch the gate to `os.path.lexists`, which is True for any symlink regardless of target validity. A previously recorded symlink now short-circuits the entire download path, preserving the user's archived layout byte-for-byte. Hashing of the shared blob is gated behind `os.path.exists` so we don't crash on broken links. Mirrors the change in both the scheduled backup flow (`src/telegram_backup.py`) and the real-time listener (`src/listener.py`). Existing tests that asserted the old "replace dangling symlink" behavior have been updated to assert the new "preserve existing symlink" contract, which is what idempotent rerun requires. Refs: #143 Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

GeiserX added 3 commits May 3, 2026 22:36

Rename AGENTS.md to CLAUDE.md

cbeff9e

GeiserX force-pushed the fix/content-hash-dedup branch from 36e45a2 to ce35180 Compare May 3, 2026 20:38

GeiserX merged commit af26914 into main May 3, 2026
10 of 11 checks passed

GeiserX deleted the fix/content-hash-dedup branch May 3, 2026 20:40

GeiserX mentioned this pull request May 3, 2026

[Bug]: Media deduplication #138

Closed

2 tasks

This was referenced May 5, 2026

[Question]: Account for previously downloaded (but now symlinks) files #143

Closed

ENH: idempotent rerun on git-annex'ed archives (symlinks etc) #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: content-hash dedup for cross-type media duplicates#139

feat: content-hash dedup for cross-type media duplicates#139
GeiserX merged 3 commits intomainfrom
fix/content-hash-dedup

GeiserX commented May 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 3, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 3, 2026

Uh oh!

coderabbitai Bot May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

codecov Bot commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

GeiserX commented May 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

codecov Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GeiserX commented May 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 3, 2026 •

edited

Loading

codecov Bot commented May 3, 2026 •

edited

Loading