Skip to content

feat: content-hash dedup for cross-type media duplicates#139

Merged
GeiserX merged 3 commits intomainfrom
fix/content-hash-dedup
May 3, 2026
Merged

feat: content-hash dedup for cross-type media duplicates#139
GeiserX merged 3 commits intomainfrom
fix/content-hash-dedup

Conversation

@GeiserX
Copy link
Copy Markdown
Owner

@GeiserX GeiserX commented May 3, 2026

Summary

  • Adds SHA-256 content hashing after media download to detect identical binary content across different Telegram file IDs (e.g., same video sent as streaming vs document)
  • New content_hash column on the media table with Alembic migration 011 (idempotent, supports both PostgreSQL and SQLite)
  • Content-hash dedup integrated in both scheduled backup (_process_media) and real-time listener (_download_media) flows
  • entrypoint.sh stamping logic updated for both DB engines

How it works

  1. After downloading a new file to _shared/, compute its SHA-256 hash
  2. Look up the hash in the media table via find_media_by_content_hash()
  3. If a match exists and the shared file is present on disk, delete the new download and symlink to the existing file instead
  4. Store the content_hash on every media record for future lookups

Test plan

  • All 1163 non-web tests pass locally
  • ruff check . and ruff format --check . pass
  • CI lint + test workflows pass
  • Verify migration runs cleanly on fresh PostgreSQL database
  • Verify migration runs cleanly on fresh SQLite database
  • Verify pre-Alembic database stamps correctly at version 011 when content_hash column exists

Closes #138

Summary by CodeRabbit

Release Notes

  • New Features

    • Added content hash-based deduplication for media files to prevent storing duplicate content across different downloads.
    • Improved database migration version detection for better upgrade handling.
  • Tests

    • Updated media handling tests to reflect new deduplication behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 3, 2026

Warning

Rate limit exceeded

@GeiserX has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 33 minutes before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 560f85f0-45bc-4fe9-b5e8-f5c1d9aa25d4

📥 Commits

Reviewing files that changed from the base of the PR and between 29a1c50 and ce35180.

⛔ Files ignored due to path filters (1)
  • alembic/versions/20260503_011_add_media_content_hash.py is excluded by !alembic/versions/**
📒 Files selected for processing (10)
  • .coderabbit.yaml
  • CLAUDE.md
  • scripts/entrypoint.sh
  • src/db/adapter.py
  • src/db/models.py
  • src/listener.py
  • src/message_utils.py
  • src/telegram_backup.py
  • tests/test_listener_extended.py
  • tests/test_symlink_dedup.py
📝 Walkthrough

Walkthrough

This PR implements content-hash based media deduplication to prevent re-downloading identical files with different message IDs. It adds a content_hash field to the Media model, computes SHA-256 hashes for downloaded files, and uses them to reuse existing media instead of creating duplicates.

Changes

Content-Hash Media Deduplication

Layer / File(s) Summary
Database Schema
src/db/models.py
Added `content_hash: Mapped[str
Database Adapter
src/db/adapter.py
New find_media_by_content_hash method queries media by hash; insert_media now stores the content_hash field.
Migration Detection
scripts/entrypoint.sh, .coderabbit.yaml
Entrypoint detects migration 011 schema (presence of media.content_hash) for both PostgreSQL and SQLite to stamp alembic version correctly. Configuration updated for knowledge base file patterns.
Hash Computation
src/message_utils.py
New compute_file_hash function computes SHA-256 hex digest of files in 64 KB chunks, handling symlinks and OSError gracefully.
Deduplication Logic
src/listener.py, src/telegram_backup.py
_download_media now returns (file_path, content_hash) tuple and checks find_media_by_content_hash to reuse existing shared files when content is identical. Media records include computed content_hash.
Tests
tests/test_listener_extended.py
Mock for _download_media updated to return tuple (file_path, content_hash) instead of string.

🎯 3 (Moderate) | ⏱️ ~25 minutes


Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: adding content-hash-based deduplication for media across different Telegram file types.
Description check ✅ Passed The description covers the key aspects: the feature's purpose, implementation approach, database changes with migration details, and a test plan. All major required sections are addressed.
Linked Issues check ✅ Passed The PR fully implements the requirements from #138: SHA-256 content hashing for deduplication, symlink logic to avoid duplicate storage, and support for SQLite deployments with the DEDUPLICATE_MEDIA flag.
Out of Scope Changes check ✅ Passed All code changes align with the stated objectives. The config update to .coderabbit.yaml and exception handling fix in entrypoint are minor supporting changes that enable the core functionality.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/content-hash-dedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 33 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
src/telegram_backup.py (3)

1509-1533: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Hash direct downloads too.

When deduplicate_media is false, this path still writes content_hash=None. That leaves scheduled-backup rows inconsistent with the listener flow and prevents later hash-based reuse if dedup is turned back on. Compute the hash from file_path before building media_data.

Proposed fix
             else:
                 # No deduplication - download directly to chat directory
                 if not os.path.exists(file_path):
                     actual_path = await self.client.download_media(message, file_path)
                     if actual_path and isinstance(actual_path, str):
                         file_path = actual_path
                     logger.debug(f"Downloaded media: {file_name}")
 
                 # Update file_size with actual size from disk
                 if os.path.exists(file_path):
                     file_size = os.path.getsize(file_path)
+                    content_hash = compute_file_hash(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 1509 - 1533, When deduplicate_media is
False the code leaves content_hash as None; after the download branch that may
update file_path (the await self.client.download_media(...) block) compute the
content hash from the downloaded file_path (only if the file exists) and set
content_hash before constructing media_data so scheduled-backup rows include the
actual hash; update the logic around file_path/content_hash prior to the
media_data dict creation (referencing file_path, content_hash, download_media /
self.client.download_media, and media_data).

1477-1501: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't move a reused shared file out of _shared.

After a content-hash hit, shared_file_path may point at an older canonical blob. If os.symlink() then fails, shutil.move(shared_file_path, file_path) removes that canonical file from _shared, which can leave existing symlinks dangling and makes later dedup miss. Copy in this branch, or only move files created by this call. The same fallback exists in src/listener.py.

Proposed fix
+                        reused_existing_shared = False
                         content_hash = compute_file_hash(shared_file_path)
                         if content_hash and hasattr(self, "db"):
                             existing = await self.db.find_media_by_content_hash(content_hash)
                             if existing and existing.get("file_name"):
                                 existing_shared = os.path.join(shared_dir, existing["file_name"])
                                 if os.path.exists(existing_shared) and existing_shared != shared_file_path:
                                     os.remove(shared_file_path)
                                     shared_file_path = existing_shared
+                                    reused_existing_shared = True
                                     logger.debug(
                                         f"Content-hash dedup: {file_name} matches existing {existing['file_name']}"
                                     )
@@
                         except OSError as e:
                             # Symlink not supported (e.g., Windows) - move file to chat dir instead
                             logger.warning(f"Symlink not supported, using direct path: {e}")
                             import shutil
 
-                            shutil.move(shared_file_path, file_path)
+                            if reused_existing_shared:
+                                shutil.copy2(shared_file_path, file_path)
+                            else:
+                                shutil.move(shared_file_path, file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 1477 - 1501, The fallback branch
currently moves shared_file_path into file_path which deletes canonical blobs
when a content-hash hit reused an existing file; change the fallback to copy
instead of move when shared_file_path points into the shared_dir or when the
file was resolved via db.find_media_by_content_hash (i.e. after
compute_file_hash and existing lookup), so that existing canonical files are
preserved; only perform shutil.move if the file was created by this call
(detectable by a local-temp path or a flag you set when writing the file),
otherwise use shutil.copy2(shared_file_path, file_path) and preserve
permissions/mtime.

101-104: ⚠️ Potential issue | 🔴 Critical

Fix Python 3 exception syntax at line 103.

except ValueError, TypeError: is invalid Python 3 syntax and prevents the module from importing. Change to tuple syntax: except (ValueError, TypeError):.

Proposed fix
-    except ValueError, TypeError:
+    except (ValueError, TypeError):
         log_threshold_seconds = 10
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/telegram_backup.py` around lines 101 - 104, The except clause in the
log_threshold_seconds parsing block uses Python 2 syntax (`except ValueError,
TypeError:`) which breaks imports; update the exception handling in that
try/except around int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches
both exceptions using tuple syntax `except (ValueError, TypeError):`, leaving
the fallback assignment to log_threshold_seconds = 10 unchanged.
src/listener.py (1)

910-923: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist file_name for listener-created media rows.

find_media_by_content_hash() rebuilds the canonical _shared path from Media.file_name. These inserts leave file_name null, so a listener-downloaded row can be selected for a hash hit and then treated as unusable, which makes later backup/listener dedup miss. Return file_name from _download_media() and store it here.

Proposed fix
-    async def _download_media(self, message, chat_id: int) -> tuple[str, str | None] | None:
+    async def _download_media(self, message, chat_id: int) -> tuple[str, str, str | None] | None:
@@
-            return f"{self.config.media_path}/{chat_id}/{file_name}", file_content_hash
+            return f"{self.config.media_path}/{chat_id}/{file_name}", file_name, file_content_hash
@@
-                            if download_result:
-                                media_path, content_hash = download_result
+                            if download_result:
+                                media_path, file_name, content_hash = download_result
                                 # Create media record (FK to messages now satisfied)
                                 media_id = f"{chat_id}_{message.id}_{media_type}"
                                 await self.db.insert_media(
                                     {
                                         "id": media_id,
                                         "message_id": message.id,
                                         "chat_id": chat_id,
                                         "type": media_type,
+                                        "file_name": file_name,
                                         "file_path": media_path,
                                         "content_hash": content_hash,
                                         "downloaded": True,
                                         "download_date": datetime.utcnow(),
                                     }
                                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/listener.py` around lines 910 - 923, The listener inserts downloaded
media rows without setting Media.file_name which breaks
find_media_by_content_hash; update _download_media to return file_name (in
addition to media_path and content_hash), change the unpacking of
download_result in listener.py to media_path, content_hash, file_name, and
include "file_name": file_name in the dict passed to self.db.insert_media (keep
media_id, message_id, chat_id, type, file_path, content_hash, downloaded as-is)
so listener-created rows have file_name populated for canonical path
reconstruction.
tests/test_listener_extended.py (1)

1612-1638: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add content_hash assertion to complete the coverage for this feature.

The mock now returns (path, content_hash) but the test only asserts on file_path and downloaded. If the listener silently drops the hash when calling insert_media, this test won't catch it — and persisting content_hash is the core objective of this PR.

✅ Proposed assertion addition
 media_data = db.insert_media.call_args[0][0]
 assert media_data["file_path"] == "/tmp/media/-100/photo.jpg"
 assert media_data["downloaded"] is True
+assert media_data.get("content_hash") is None  # None passed from mock tuple's second element
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_listener_extended.py` around lines 1612 - 1638, The test mocks
listener._download_media to return ("/tmp/media/-100/photo.jpg",
"expected_hash") but never asserts that the listener forwards the content_hash
into db.insert_media; update the test after calling handler(event) to check
media_data = db.insert_media.call_args[0][0] and add an assertion that
media_data["content_hash"] == "expected_hash" (ensuring the mocked content_hash
value is used), leaving the existing assertions for file_path and downloaded
intact and referencing listener._download_media, handler, and db.insert_media to
locate the relevant lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.coderabbit.yaml:
- Around line 63-64: Remove the duplicate list entry "CLAUDE.md" so the YAML
list contains a single unique "CLAUDE.md" line; locate the duplicate strings
"CLAUDE.md" in the list and delete one of them (leaving one entry), ensuring
list formatting/indentation remains valid YAML.

In `@src/db/adapter.py`:
- Around line 677-689: The current find_media_by_content_hash lookup is
vulnerable to race conditions because both downloaders write a _shared file then
run this unlocked SELECT; instead implement a hash-scoped canonicalization
around winner selection: in find_media_by_content_hash (or a new helper used by
both download paths) acquire a lock keyed by Media.content_hash (e.g., DB
advisory lock or a dedicated canonicalization table row locked with SELECT ...
FOR UPDATE inside a transaction), then re-check/select the Media row with
Media.content_hash and Media.downloaded inside that lock/transaction; if absent,
allow the caller designated as winner to insert or upsert a canonical Media
record (using INSERT ... ON CONFLICT or an explicit insert-after-lock) and
return its file_path/file_name/content_hash, otherwise return the existing
row—ensure both download paths use this locked canonicalization function so only
one shared file is adopted per content_hash.

---

Outside diff comments:
In `@src/listener.py`:
- Around line 910-923: The listener inserts downloaded media rows without
setting Media.file_name which breaks find_media_by_content_hash; update
_download_media to return file_name (in addition to media_path and
content_hash), change the unpacking of download_result in listener.py to
media_path, content_hash, file_name, and include "file_name": file_name in the
dict passed to self.db.insert_media (keep media_id, message_id, chat_id, type,
file_path, content_hash, downloaded as-is) so listener-created rows have
file_name populated for canonical path reconstruction.

In `@src/telegram_backup.py`:
- Around line 1509-1533: When deduplicate_media is False the code leaves
content_hash as None; after the download branch that may update file_path (the
await self.client.download_media(...) block) compute the content hash from the
downloaded file_path (only if the file exists) and set content_hash before
constructing media_data so scheduled-backup rows include the actual hash; update
the logic around file_path/content_hash prior to the media_data dict creation
(referencing file_path, content_hash, download_media /
self.client.download_media, and media_data).
- Around line 1477-1501: The fallback branch currently moves shared_file_path
into file_path which deletes canonical blobs when a content-hash hit reused an
existing file; change the fallback to copy instead of move when shared_file_path
points into the shared_dir or when the file was resolved via
db.find_media_by_content_hash (i.e. after compute_file_hash and existing
lookup), so that existing canonical files are preserved; only perform
shutil.move if the file was created by this call (detectable by a local-temp
path or a flag you set when writing the file), otherwise use
shutil.copy2(shared_file_path, file_path) and preserve permissions/mtime.
- Around line 101-104: The except clause in the log_threshold_seconds parsing
block uses Python 2 syntax (`except ValueError, TypeError:`) which breaks
imports; update the exception handling in that try/except around
int(os.getenv("FLOOD_WAIT_LOG_THRESHOLD", "10")) so it catches both exceptions
using tuple syntax `except (ValueError, TypeError):`, leaving the fallback
assignment to log_threshold_seconds = 10 unchanged.

In `@tests/test_listener_extended.py`:
- Around line 1612-1638: The test mocks listener._download_media to return
("/tmp/media/-100/photo.jpg", "expected_hash") but never asserts that the
listener forwards the content_hash into db.insert_media; update the test after
calling handler(event) to check media_data = db.insert_media.call_args[0][0] and
add an assertion that media_data["content_hash"] == "expected_hash" (ensuring
the mocked content_hash value is used), leaving the existing assertions for
file_path and downloaded intact and referencing listener._download_media,
handler, and db.insert_media to locate the relevant lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: df26cb70-2164-4b3c-ac5e-4660b67a2cdd

📥 Commits

Reviewing files that changed from the base of the PR and between 91ccfd3 and 29a1c50.

⛔ Files ignored due to path filters (1)
  • alembic/versions/20260503_011_add_media_content_hash.py is excluded by !alembic/versions/**
📒 Files selected for processing (9)
  • .coderabbit.yaml
  • CLAUDE.md
  • scripts/entrypoint.sh
  • src/db/adapter.py
  • src/db/models.py
  • src/listener.py
  • src/message_utils.py
  • src/telegram_backup.py
  • tests/test_listener_extended.py

Comment thread .coderabbit.yaml
Comment on lines +63 to 64
- "CLAUDE.md"
- "CLAUDE.md"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove duplicate CLAUDE.md entry.

Line 64 duplicates line 63.

🔧 Proposed fix
     filePatterns:
       - "CLAUDE.md"
-      - "CLAUDE.md"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- "CLAUDE.md"
- "CLAUDE.md"
filePatterns:
- "CLAUDE.md"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.coderabbit.yaml around lines 63 - 64, Remove the duplicate list entry
"CLAUDE.md" so the YAML list contains a single unique "CLAUDE.md" line; locate
the duplicate strings "CLAUDE.md" in the list and delete one of them (leaving
one entry), ensuring list formatting/indentation remains valid YAML.

Comment thread src/db/adapter.py
Comment on lines +677 to +689
async def find_media_by_content_hash(self, content_hash: str) -> dict[str, Any] | None:
"""Find an existing downloaded media record with the given SHA-256 content hash."""
async with self.db_manager.async_session_factory() as session:
stmt = select(Media).where(and_(Media.content_hash == content_hash, Media.downloaded == 1)).limit(1)
result = await session.execute(stmt)
media = result.scalar_one_or_none()
if media is None:
return None
return {
"file_path": media.file_path,
"file_name": media.file_name,
"content_hash": media.content_hash,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Hash lookup alone won't stop concurrent duplicate blobs.

Both download paths write a _shared/... file first and only then call this unlocked SELECT ... LIMIT 1. If the backup and listener download the same bytes at the same time, they can both miss here and keep separate shared files under different Telegram IDs, so disk dedup still fails. This needs a hash-scoped lock or another canonicalization step around winner selection, not just a post-download lookup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/db/adapter.py` around lines 677 - 689, The current
find_media_by_content_hash lookup is vulnerable to race conditions because both
downloaders write a _shared file then run this unlocked SELECT; instead
implement a hash-scoped canonicalization around winner selection: in
find_media_by_content_hash (or a new helper used by both download paths) acquire
a lock keyed by Media.content_hash (e.g., DB advisory lock or a dedicated
canonicalization table row locked with SELECT ... FOR UPDATE inside a
transaction), then re-check/select the Media row with Media.content_hash and
Media.downloaded inside that lock/transaction; if absent, allow the caller
designated as winner to insert or upsert a canonical Media record (using INSERT
... ON CONFLICT or an explicit insert-after-lock) and return its
file_path/file_name/content_hash, otherwise return the existing row—ensure both
download paths use this locked canonicalization function so only one shared file
is adopted per content_hash.

GeiserX added 3 commits May 3, 2026 22:36
Telegram assigns different file IDs when the same file is sent via
different methods (streaming vs document). The existing filename-based
dedup misses these. SHA-256 content hashing after download catches
identical binary content regardless of Telegram file ID.

- Add content_hash column to Media model with index
- Add Alembic migration 011 (idempotent, PostgreSQL + SQLite)
- Add compute_file_hash() utility in message_utils
- Add find_media_by_content_hash() DB lookup
- Integrate content-hash dedup in both backup and listener flows
- Update entrypoint.sh stamping logic for migration 011

Closes #138
- Move content-hash dedup into shared deduplicate_shared_file() in
  message_utils.py (DRY: both backup and listener use it)
- Add path traversal guard via realpath containment check
- Handle TOCTOU race on os.remove with FileNotFoundError catch
- Return reused flag so callers use copy2 instead of move for
  canonical blobs (prevents destroying shared store entries)
- Return file_name from listener's _download_media for DB consistency
- Compute content_hash in backup's no-dedup branch for completeness
- Fix test mocks to supply db.find_media_by_content_hash
@GeiserX GeiserX force-pushed the fix/content-hash-dedup branch from 36e45a2 to ce35180 Compare May 3, 2026 20:38
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

🐳 Dev images published!

  • drumsergio/telegram-archive:dev
  • drumsergio/telegram-archive-viewer:dev

The dev/test instance will pick up these changes automatically (Portainer GitOps).

To test locally:

docker pull drumsergio/telegram-archive:dev
docker pull drumsergio/telegram-archive-viewer:dev

@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

❌ Patch coverage is 66.19718% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.07%. Comparing base (91ccfd3) to head (ce35180).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/message_utils.py 60.60% 13 Missing ⚠️
src/db/adapter.py 12.50% 7 Missing ⚠️
src/listener.py 89.47% 2 Missing ⚠️
src/telegram_backup.py 80.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (66.19%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
- Coverage   94.41%   94.07%   -0.34%     
==========================================
  Files          21       21              
  Lines        6066     6127      +61     
==========================================
+ Hits         5727     5764      +37     
- Misses        339      363      +24     
Files with missing lines Coverage Δ
src/db/models.py 100.00% <100.00%> (ø)
src/listener.py 97.99% <89.47%> (-0.26%) ⬇️
src/telegram_backup.py 96.10% <80.00%> (-0.16%) ⬇️
src/db/adapter.py 88.78% <12.50%> (-0.76%) ⬇️
src/message_utils.py 67.50% <60.60%> (-32.50%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@GeiserX GeiserX merged commit af26914 into main May 3, 2026
10 of 11 checks passed
@GeiserX GeiserX deleted the fix/content-hash-dedup branch May 3, 2026 20:40
@GeiserX GeiserX mentioned this pull request May 3, 2026
2 tasks
GeiserX pushed a commit that referenced this pull request May 7, 2026
The "do I already have this media?" check in `_process_media` and the
listener's `_download_media` used `os.path.exists`, which follows the
symlink chain to its ultimate target. For archived layouts where media
files are symlinks (e.g. into `.git/annex/objects/...`), the target may
be unreachable from the running process -- typical when the Docker /
Podman bind mount only covers the working tree but not `.git/`. In that
case `os.path.exists` returns False even though the symlink is intact,
and the script enters the "first-time download" branch:

  1. `_shared/<name>` is overwritten via atomic rename (.part -> .),
     replacing a git-annex symlink with a freshly-downloaded regular
     file (visible as `typechange:` in `git status`).
  2. Content-hash dedup (#139) then queries the DB for a SHA-256 match
     and may rewrite the chat-dir symlink to point at a different
     canonical blob processed earlier in the same run -- a non-
     deterministic mutation of the working tree across reruns.

Switch the gate to `os.path.lexists`, which is True for any symlink
regardless of target validity. A previously recorded symlink now
short-circuits the entire download path, preserving the user's archived
layout byte-for-byte. Hashing of the shared blob is gated behind
`os.path.exists` so we don't crash on broken links.

Mirrors the change in both the scheduled backup flow
(`src/telegram_backup.py`) and the real-time listener
(`src/listener.py`).

Existing tests that asserted the old "replace dangling symlink" behavior
have been updated to assert the new "preserve existing symlink" contract,
which is what idempotent rerun requires.

Refs: #143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Media deduplication

1 participant