feat: content-hash dedup for cross-type media duplicates#139
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis PR implements content-hash based media deduplication to prevent re-downloading identical files with different message IDs. It adds a ChangesContent-Hash Media Deduplication
🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 33 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/telegram_backup.py (3)
1509-1533:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHash direct downloads too.
When
deduplicate_mediais false, this path still writescontent_hash=None. That leaves scheduled-backup rows inconsistent with the listener flow and prevents later hash-based reuse if dedup is turned back on. Compute the hash fromfile_pathbefore buildingmedia_data.Proposed fix
else: # No deduplication - download directly to chat directory if not os.path.exists(file_path): actual_path = await self.client.download_media(message, file_path) if actual_path and isinstance(actual_path, str): file_path = actual_path logger.debug(f"Downloaded media: {file_name}") # Update file_size with actual size from disk if os.path.exists(file_path): file_size = os.path.getsize(file_path) + content_hash = compute_file_hash(file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/telegram_backup.py` around lines 1509 - 1533, When deduplicate_media is False the code leaves content_hash as None; after the download branch that may update file_path (the await self.client.download_media(...) block) compute the content hash from the downloaded file_path (only if the file exists) and set content_hash before constructing media_data so scheduled-backup rows include the actual hash; update the logic around file_path/content_hash prior to the media_data dict creation (referencing file_path, content_hash, download_media / self.client.download_media, and media_data).
1477-1501:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't move a reused shared file out of
_shared.After a content-hash hit,
shared_file_pathmay point at an older canonical blob. Ifos.symlink()then fails,shutil.move(shared_file_path, file_path)removes that canonical file from_shared, which can leave existing symlinks dangling and makes later dedup miss. Copy in this branch, or only move files created by this call. The same fallback exists insrc/listener.py.Proposed fix
+ reused_existing_shared = False content_hash = compute_file_hash(shared_file_path) if content_hash and hasattr(self, "db"): existing = await self.db.find_media_by_content_hash(content_hash) if existing and existing.get("file_name"): existing_shared = os.path.join(shared_dir, existing["file_name"]) if os.path.exists(existing_shared) and existing_shared != shared_file_path: os.remove(shared_file_path) shared_file_path = existing_shared + reused_existing_shared = True logger.debug( f"Content-hash dedup: {file_name} matches existing {existing['file_name']}" ) @@ except OSError as e: # Symlink not supported (e.g., Windows) - move file to chat dir instead logger.warning(f"Symlink not supported, using direct path: {e}") import shutil - shutil.move(shared_file_path, file_path) + if reused_existing_shared: + shutil.copy2(shared_file_path, file_path) + else: + shutil.move(shared_file_path, file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/telegram_backup.py` around lines 1477 - 1501, The fallback branch currently moves shared_file_path into file_path which deletes canonical blobs when a content-hash hit reused an existing file; change the fallback to copy instead of move when shared_file_path points into the shared_dir or when the file was resolved via db.find_media_by_content_hash (i.e. after compute_file_hash and existing lookup), so that existing canonical files are preserved; only perform shutil.move if the file was created by this call (detectable by a local-temp path or a flag you set when writing the file), otherwise use shutil.copy2(shared_file_path, file_path) and preserve permissions/mtime.
101-104:⚠️ Potential issue | 🔴 CriticalFix Python 3 exception syntax at line 103.
except ValueError, TypeError:is invalid Python 3 syntax and prevents the module from importing. Change to tuple syntax:except (ValueError, TypeError):.Proposed fix
- except ValueError, TypeError: + except (ValueError, TypeError): log_threshold_seconds = 10🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/telegram_backup.py` around lines 101 - 104, The except clause in the log_threshold_seconds parsing block uses Python 2 syntax (`except ValueError, TypeError:`) which breaks imports; update the exception handling in that try/except around int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches both exceptions using tuple syntax `except (ValueError, TypeError):`, leaving the fallback assignment to log_threshold_seconds = 10 unchanged.src/listener.py (1)
910-923:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPersist
file_namefor listener-created media rows.
find_media_by_content_hash()rebuilds the canonical_sharedpath fromMedia.file_name. These inserts leavefile_namenull, so a listener-downloaded row can be selected for a hash hit and then treated as unusable, which makes later backup/listener dedup miss. Returnfile_namefrom_download_media()and store it here.Proposed fix
- async def _download_media(self, message, chat_id: int) -> tuple[str, str | None] | None: + async def _download_media(self, message, chat_id: int) -> tuple[str, str, str | None] | None: @@ - return f"{self.config.media_path}/{chat_id}/{file_name}", file_content_hash + return f"{self.config.media_path}/{chat_id}/{file_name}", file_name, file_content_hash @@ - if download_result: - media_path, content_hash = download_result + if download_result: + media_path, file_name, content_hash = download_result # Create media record (FK to messages now satisfied) media_id = f"{chat_id}_{message.id}_{media_type}" await self.db.insert_media( { "id": media_id, "message_id": message.id, "chat_id": chat_id, "type": media_type, + "file_name": file_name, "file_path": media_path, "content_hash": content_hash, "downloaded": True, "download_date": datetime.utcnow(), } )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/listener.py` around lines 910 - 923, The listener inserts downloaded media rows without setting Media.file_name which breaks find_media_by_content_hash; update _download_media to return file_name (in addition to media_path and content_hash), change the unpacking of download_result in listener.py to media_path, content_hash, file_name, and include "file_name": file_name in the dict passed to self.db.insert_media (keep media_id, message_id, chat_id, type, file_path, content_hash, downloaded as-is) so listener-created rows have file_name populated for canonical path reconstruction.tests/test_listener_extended.py (1)
1612-1638:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd
content_hashassertion to complete the coverage for this feature.The mock now returns
(path, content_hash)but the test only asserts onfile_pathanddownloaded. If the listener silently drops the hash when callinginsert_media, this test won't catch it — and persistingcontent_hashis the core objective of this PR.✅ Proposed assertion addition
media_data = db.insert_media.call_args[0][0] assert media_data["file_path"] == "/tmp/media/-100/photo.jpg" assert media_data["downloaded"] is True +assert media_data.get("content_hash") is None # None passed from mock tuple's second element🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_listener_extended.py` around lines 1612 - 1638, The test mocks listener._download_media to return ("/tmp/media/-100/photo.jpg", "expected_hash") but never asserts that the listener forwards the content_hash into db.insert_media; update the test after calling handler(event) to check media_data = db.insert_media.call_args[0][0] and add an assertion that media_data["content_hash"] == "expected_hash" (ensuring the mocked content_hash value is used), leaving the existing assertions for file_path and downloaded intact and referencing listener._download_media, handler, and db.insert_media to locate the relevant lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.coderabbit.yaml:
- Around line 63-64: Remove the duplicate list entry "CLAUDE.md" so the YAML
list contains a single unique "CLAUDE.md" line; locate the duplicate strings
"CLAUDE.md" in the list and delete one of them (leaving one entry), ensuring
list formatting/indentation remains valid YAML.
In `@src/db/adapter.py`:
- Around line 677-689: The current find_media_by_content_hash lookup is
vulnerable to race conditions because both downloaders write a _shared file then
run this unlocked SELECT; instead implement a hash-scoped canonicalization
around winner selection: in find_media_by_content_hash (or a new helper used by
both download paths) acquire a lock keyed by Media.content_hash (e.g., DB
advisory lock or a dedicated canonicalization table row locked with SELECT ...
FOR UPDATE inside a transaction), then re-check/select the Media row with
Media.content_hash and Media.downloaded inside that lock/transaction; if absent,
allow the caller designated as winner to insert or upsert a canonical Media
record (using INSERT ... ON CONFLICT or an explicit insert-after-lock) and
return its file_path/file_name/content_hash, otherwise return the existing
row—ensure both download paths use this locked canonicalization function so only
one shared file is adopted per content_hash.
---
Outside diff comments:
In `@src/listener.py`:
- Around line 910-923: The listener inserts downloaded media rows without
setting Media.file_name which breaks find_media_by_content_hash; update
_download_media to return file_name (in addition to media_path and
content_hash), change the unpacking of download_result in listener.py to
media_path, content_hash, file_name, and include "file_name": file_name in the
dict passed to self.db.insert_media (keep media_id, message_id, chat_id, type,
file_path, content_hash, downloaded as-is) so listener-created rows have
file_name populated for canonical path reconstruction.
In `@src/telegram_backup.py`:
- Around line 1509-1533: When deduplicate_media is False the code leaves
content_hash as None; after the download branch that may update file_path (the
await self.client.download_media(...) block) compute the content hash from the
downloaded file_path (only if the file exists) and set content_hash before
constructing media_data so scheduled-backup rows include the actual hash; update
the logic around file_path/content_hash prior to the media_data dict creation
(referencing file_path, content_hash, download_media /
self.client.download_media, and media_data).
- Around line 1477-1501: The fallback branch currently moves shared_file_path
into file_path which deletes canonical blobs when a content-hash hit reused an
existing file; change the fallback to copy instead of move when shared_file_path
points into the shared_dir or when the file was resolved via
db.find_media_by_content_hash (i.e. after compute_file_hash and existing
lookup), so that existing canonical files are preserved; only perform
shutil.move if the file was created by this call (detectable by a local-temp
path or a flag you set when writing the file), otherwise use
shutil.copy2(shared_file_path, file_path) and preserve permissions/mtime.
- Around line 101-104: The except clause in the log_threshold_seconds parsing
block uses Python 2 syntax (`except ValueError, TypeError:`) which breaks
imports; update the exception handling in that try/except around
int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches both exceptions
using tuple syntax `except (ValueError, TypeError):`, leaving the fallback
assignment to log_threshold_seconds = 10 unchanged.
In `@tests/test_listener_extended.py`:
- Around line 1612-1638: The test mocks listener._download_media to return
("/tmp/media/-100/photo.jpg", "expected_hash") but never asserts that the
listener forwards the content_hash into db.insert_media; update the test after
calling handler(event) to check media_data = db.insert_media.call_args[0][0] and
add an assertion that media_data["content_hash"] == "expected_hash" (ensuring
the mocked content_hash value is used), leaving the existing assertions for
file_path and downloaded intact and referencing listener._download_media,
handler, and db.insert_media to locate the relevant lines.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: df26cb70-2164-4b3c-ac5e-4660b67a2cdd
⛔ Files ignored due to path filters (1)
alembic/versions/20260503_011_add_media_content_hash.pyis excluded by!alembic/versions/**
📒 Files selected for processing (9)
.coderabbit.yamlCLAUDE.mdscripts/entrypoint.shsrc/db/adapter.pysrc/db/models.pysrc/listener.pysrc/message_utils.pysrc/telegram_backup.pytests/test_listener_extended.py
| - "CLAUDE.md" | ||
| - "CLAUDE.md" |
There was a problem hiding this comment.
Remove duplicate CLAUDE.md entry.
Line 64 duplicates line 63.
🔧 Proposed fix
filePatterns:
- "CLAUDE.md"
- - "CLAUDE.md"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - "CLAUDE.md" | |
| - "CLAUDE.md" | |
| filePatterns: | |
| - "CLAUDE.md" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.coderabbit.yaml around lines 63 - 64, Remove the duplicate list entry
"CLAUDE.md" so the YAML list contains a single unique "CLAUDE.md" line; locate
the duplicate strings "CLAUDE.md" in the list and delete one of them (leaving
one entry), ensuring list formatting/indentation remains valid YAML.
| async def find_media_by_content_hash(self, content_hash: str) -> dict[str, Any] | None: | ||
| """Find an existing downloaded media record with the given SHA-256 content hash.""" | ||
| async with self.db_manager.async_session_factory() as session: | ||
| stmt = select(Media).where(and_(Media.content_hash == content_hash, Media.downloaded == 1)).limit(1) | ||
| result = await session.execute(stmt) | ||
| media = result.scalar_one_or_none() | ||
| if media is None: | ||
| return None | ||
| return { | ||
| "file_path": media.file_path, | ||
| "file_name": media.file_name, | ||
| "content_hash": media.content_hash, | ||
| } |
There was a problem hiding this comment.
Hash lookup alone won't stop concurrent duplicate blobs.
Both download paths write a _shared/... file first and only then call this unlocked SELECT ... LIMIT 1. If the backup and listener download the same bytes at the same time, they can both miss here and keep separate shared files under different Telegram IDs, so disk dedup still fails. This needs a hash-scoped lock or another canonicalization step around winner selection, not just a post-download lookup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/db/adapter.py` around lines 677 - 689, The current
find_media_by_content_hash lookup is vulnerable to race conditions because both
downloaders write a _shared file then run this unlocked SELECT; instead
implement a hash-scoped canonicalization around winner selection: in
find_media_by_content_hash (or a new helper used by both download paths) acquire
a lock keyed by Media.content_hash (e.g., DB advisory lock or a dedicated
canonicalization table row locked with SELECT ... FOR UPDATE inside a
transaction), then re-check/select the Media row with Media.content_hash and
Media.downloaded inside that lock/transaction; if absent, allow the caller
designated as winner to insert or upsert a canonical Media record (using INSERT
... ON CONFLICT or an explicit insert-after-lock) and return its
file_path/file_name/content_hash, otherwise return the existing row—ensure both
download paths use this locked canonicalization function so only one shared file
is adopted per content_hash.
Telegram assigns different file IDs when the same file is sent via different methods (streaming vs document). The existing filename-based dedup misses these. SHA-256 content hashing after download catches identical binary content regardless of Telegram file ID. - Add content_hash column to Media model with index - Add Alembic migration 011 (idempotent, PostgreSQL + SQLite) - Add compute_file_hash() utility in message_utils - Add find_media_by_content_hash() DB lookup - Integrate content-hash dedup in both backup and listener flows - Update entrypoint.sh stamping logic for migration 011 Closes #138
- Move content-hash dedup into shared deduplicate_shared_file() in message_utils.py (DRY: both backup and listener use it) - Add path traversal guard via realpath containment check - Handle TOCTOU race on os.remove with FileNotFoundError catch - Return reused flag so callers use copy2 instead of move for canonical blobs (prevents destroying shared store entries) - Return file_name from listener's _download_media for DB consistency - Compute content_hash in backup's no-dedup branch for completeness - Fix test mocks to supply db.find_media_by_content_hash
36e45a2 to
ce35180
Compare
|
🐳 Dev images published!
The dev/test instance will pick up these changes automatically (Portainer GitOps). To test locally: docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (66.19%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #139 +/- ##
==========================================
- Coverage 94.41% 94.07% -0.34%
==========================================
Files 21 21
Lines 6066 6127 +61
==========================================
+ Hits 5727 5764 +37
- Misses 339 363 +24
🚀 New features to boost your workflow:
|
The "do I already have this media?" check in `_process_media` and the
listener's `_download_media` used `os.path.exists`, which follows the
symlink chain to its ultimate target. For archived layouts where media
files are symlinks (e.g. into `.git/annex/objects/...`), the target may
be unreachable from the running process -- typical when the Docker /
Podman bind mount only covers the working tree but not `.git/`. In that
case `os.path.exists` returns False even though the symlink is intact,
and the script enters the "first-time download" branch:
1. `_shared/<name>` is overwritten via atomic rename (.part -> .),
replacing a git-annex symlink with a freshly-downloaded regular
file (visible as `typechange:` in `git status`).
2. Content-hash dedup (#139) then queries the DB for a SHA-256 match
and may rewrite the chat-dir symlink to point at a different
canonical blob processed earlier in the same run -- a non-
deterministic mutation of the working tree across reruns.
Switch the gate to `os.path.lexists`, which is True for any symlink
regardless of target validity. A previously recorded symlink now
short-circuits the entire download path, preserving the user's archived
layout byte-for-byte. Hashing of the shared blob is gated behind
`os.path.exists` so we don't crash on broken links.
Mirrors the change in both the scheduled backup flow
(`src/telegram_backup.py`) and the real-time listener
(`src/listener.py`).
Existing tests that asserted the old "replace dangling symlink" behavior
have been updated to assert the new "preserve existing symlink" contract,
which is what idempotent rerun requires.
Refs: #143
Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
content_hashcolumn on themediatable with Alembic migration 011 (idempotent, supports both PostgreSQL and SQLite)_process_media) and real-time listener (_download_media) flowsentrypoint.shstamping logic updated for both DB enginesHow it works
_shared/, compute its SHA-256 hashmediatable viafind_media_by_content_hash()content_hashon every media record for future lookupsTest plan
ruff check .andruff format --check .passcontent_hashcolumn existsCloses #138
Summary by CodeRabbit
Release Notes
New Features
Tests