Skip to content

ENH: idempotent rerun on git-annex'ed archives (symlinks etc)#146

Merged
GeiserX merged 5 commits intoGeiserX:mainfrom
yarikoptic:enh-idempotent-rerun
May 7, 2026
Merged

ENH: idempotent rerun on git-annex'ed archives (symlinks etc)#146
GeiserX merged 5 commits intoGeiserX:mainfrom
yarikoptic:enh-idempotent-rerun

Conversation

@yarikoptic-gitmate
Copy link
Copy Markdown

@yarikoptic-gitmate yarikoptic-gitmate commented May 6, 2026

Re-runs of python -m src.telegram_backup against an archived media tree
(committed to git-annex / DataLad, where files are symlinks pointing into
.git/annex/objects/) currently mutate the working tree on every run:

  • _shared/<name> symlinks get atomic-renamed and become regular files
    (visible as typechange in git status),
  • chat-dir symlinks get repointed at different _shared/ blobs across
    runs because content-hash dedup (feat: content-hash dedup for cross-type media duplicates #139) picks a non-deterministic
    canonical name from the cross-chat duplicate set.

Root cause: every "do I already have this?" gate uses os.path.exists,
which follows symlinks to the bottom. When the ultimate target lives
outside the running process's filesystem view (e.g. the bind mount only
covers the working tree, not .git/), the gate concludes the file is
missing and triggers a re-download that overwrites the symlink. See
#143 for the full diagnosis with line references.

Changes

Each commit is a single conceptual change so they can be reviewed (or
backed out) independently.

  1. fix(media): trust existing symlinks via os.path.lexists in dedup gate — switches both arms of the dedup gate (chat path and _shared/-path) from os.path.exists to os.path.lexists in _process_media (src/telegram_backup.py) and the listener twin in src/listener.py. Updates the existing dangling-symlink tests in tests/test_symlink_dedup.py and the listener-side tests in tests/test_listener_extended.py to assert the new "preserve existing symlink" contract instead of the old "replace dangling symlink" one.
  2. fix(avatars): trust existing avatar symlinks via os.path.lexists — same gate change in _ensure_profile_photo and the listener's _download_avatar. Adds a regression test (test_existing_symlink_avatar_is_preserved) that reproduces the exact [Errno 2] No such file or directory log line from the issue.
  3. fix(verify): preserve symlinks during VERIFY_MEDIA scans — Phase 1 of _verify_and_redownload_media now uses lexists for missing-file detection and short-circuits on os.path.islink, so an intact symlink is never re-downloaded. Truly absent paths still trigger recovery. New tests cover both branches.
  4. docs(readme): note bind-mount caveat for git-annex / DataLad backups — adds a Troubleshooting row + a short subsection showing the recommended -v $PWD:/repo + -w /repo/ses-acct mount layout so the symlink targets resolve from inside the container.

Test plan

  • python -m pytest tests/ (excluding the FastAPI/pydantic-only viewer suites, same as CI scope) — 1419 passed, 153 warnings, 6 subtests passed
  • ruff check . && ruff format --check . — clean
  • Existing dedup / dangling-symlink / verify tests updated to assert the new contract; no test asserts both old and new behaviors.
  • Reporter (@yarikoptic) confirms an idempotent rerun against the original archived layout from [Question]: Account for previously downloaded (but now symlinks) files #143.
    • yet to do after recreating db

Notes for review

Summary by CodeRabbit

  • Documentation

    • Added a new troubleshooting subsection for git-annex/DataLad layouts and moved PostgreSQL duplicate-key guidance into its own troubleshooting subsection.
  • Bug Fixes

    • Improved idempotency by trusting existing symlinks and avoiding redundant media/avatar re-downloads; safer handling of deduplicated/shared media.
  • Tests

    • Expanded and hardened tests to cover symlink-preserving behavior, deduplication paths, avatar handling, and listener/backup lifecycle cases.

yarikoptic and others added 4 commits May 5, 2026 17:02
The "do I already have this media?" check in `_process_media` and the
listener's `_download_media` used `os.path.exists`, which follows the
symlink chain to its ultimate target. For archived layouts where media
files are symlinks (e.g. into `.git/annex/objects/...`), the target may
be unreachable from the running process -- typical when the Docker /
Podman bind mount only covers the working tree but not `.git/`. In that
case `os.path.exists` returns False even though the symlink is intact,
and the script enters the "first-time download" branch:

  1. `_shared/<name>` is overwritten via atomic rename (.part -> .),
     replacing a git-annex symlink with a freshly-downloaded regular
     file (visible as `typechange:` in `git status`).
  2. Content-hash dedup (GeiserX#139) then queries the DB for a SHA-256 match
     and may rewrite the chat-dir symlink to point at a different
     canonical blob processed earlier in the same run -- a non-
     deterministic mutation of the working tree across reruns.

Switch the gate to `os.path.lexists`, which is True for any symlink
regardless of target validity. A previously recorded symlink now
short-circuits the entire download path, preserving the user's archived
layout byte-for-byte. Hashing of the shared blob is gated behind
`os.path.exists` so we don't crash on broken links.

Mirrors the change in both the scheduled backup flow
(`src/telegram_backup.py`) and the real-time listener
(`src/listener.py`).

Existing tests that asserted the old "replace dangling symlink" behavior
have been updated to assert the new "preserve existing symlink" contract,
which is what idempotent rerun requires.

Refs: GeiserX#143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an avatar file in `media/avatars/{users,chats}/` is a symlink whose
ultimate target is unreachable from the running process (e.g. a
git-annex object outside the bind mount), `os.path.exists(avatar_path)`
returns False, so the previous "needs_download" gate concluded the
avatar was missing and called `download_profile_photo(file=avatar_path)`.
That call followed the symlink while opening for write and surfaced the
inner ENOENT as:

    Failed to download avatar for <id>: [Errno 2] No such file or
    directory: '/data/backups/media/avatars/chats/<id>_<photo_id>.jpg'

Switch the gate to `os.path.lexists`. A symlink already in place is
trusted unconditionally; a regular file is still re-downloaded only
when it is zero-byte (interrupted prior download). Mirrors the dedup-
gate change in the previous commit and applies to both the scheduled
backup (`_ensure_profile_photo`) and the listener's `_download_avatar`.

Refs: GeiserX#143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_verify_and_redownload_media` checks every previously recorded media
file with `os.path.exists`, which follows symlink targets. For archived
layouts where the actual content is held in an external store
(git-annex `objects/`, lazy mount, separate filesystem) the symlink may
be intact while its target is unreachable from the running process.
The old behavior flagged such entries as "missing" and re-downloaded,
atomic-renaming a freshly-downloaded regular file on top of the user's
symlink and mutating the working tree (issue GeiserX#143).

Two-part fix in the Phase 1 detection loop:

1. Use `os.path.lexists` to detect "truly missing" entries. A symlink
   in place -- even a broken one -- is no longer added to the
   missing-files list.
2. After the lexists check, short-circuit on `os.path.islink`. Symlink
   contents are managed externally, so we cannot meaningfully verify
   size or emptiness without following the link, and we will not
   replace a symlink we did not write.

Truly absent paths (no entry on disk at all) are still re-downloaded.
The cleanup loop in Phase 2 already used `os.path.lexists` since GeiserX#119,
so the two phases are now consistent.

Refs: GeiserX#143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backups committed to git-annex (or DataLad) end up as symlinks pointing
into the repository's annex object store. Code changes in this series
make the backup process trust those symlinks and skip re-downloads, but
content-hash dedup can only match against existing `_shared/` blobs
when their targets are reachable from the running process. If only the
working tree is exposed to the backup (e.g. via a narrow bind mount),
the annex object store sits outside that view and dedup misses.

Add a Troubleshooting row + a short, deployment-agnostic subsection
explaining the constraint and the general remedy (expose the
repository root, then point the data path at the session subdirectory
inside it) without prescribing a specific runtime.

Refs: GeiserX#143

Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates backup and listener code to use os.path.lexists() so existing chat-directory symlinks (including dangling git-annex/DataLad layouts) are treated as authoritative during verification and deduplication; README troubleshooting was added and tests expanded to cover lexists-based behavior.

Changes

Symlink-Aware Deduplication & Idempotent Re-runs

Layer / File(s) Summary
Core Verification & Dedup
src/telegram_backup.py
Verification and download guards now use os.path.lexists; skip size/hash for symlinks and compute content hashes only when shared targets are resolvable.
Listener Logic
src/listener.py
Avatar and media download guards switched to os.path.lexists; dedup path checks shared-file resolvability and avoids clobbering existing symlinks.
Documentation
README.md
Added “git-annex / DataLad layouts” Troubleshooting subsection and moved PostgreSQL duplicate-key guidance into its own subsection.
Listener Tests
tests/test_listener_extended.py
Refactored existence checks to deterministic downloaded-flag gating and updated mocks to use lexists/islink/getsize.
Symlink Dedup Tests
tests/test_symlink_dedup.py
Added/rewrote tests asserting existing chat-dir symlinks (including dangling) are preserved and verification trusts them; retained fallback-on-symlink-creation-failure cases.
Profile-photo Tests
tests/test_telegram_backup_extended.py
Added tests ensuring dangling avatar symlinks are preserved and empty files trigger a download.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • GeiserX/Telegram-Archive#119: Also switches checks to os.path.lexists() and refactors symlink creation and filename capture for dedup handling.
  • GeiserX/Telegram-Archive#139: Modifies the same deduplication and download logic in src/listener.py and src/telegram_backup.py, including content-hash usage and shared-storage handling.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.53% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title directly describes the main enhancement: enabling idempotent reruns for git-annex archives by using symlink-aware file checks.
Description check ✅ Passed Description comprehensively covers the issue, root cause, changes across 4 commits, test results, and deployment notes; follows template structure with all key sections present.
Linked Issues check ✅ Passed All changes directly address issue #143 objectives: replacing os.path.exists with os.path.lexists in media/avatar dedup gates and verification to preserve symlinks and prevent spurious re-downloads when targets are unreachable.
Out of Scope Changes check ✅ Passed All changes are scoped to symlink handling (media dedup, avatars, verify phase) and documentation; no unrelated refactoring or feature creep detected beyond issue #143 objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@GeiserX
Copy link
Copy Markdown
Owner

GeiserX commented May 6, 2026

Multi-Agent Review — Consolidated Findings

Reviewed from 11 specialized angles (architecture, security, code quality, testing, debugging, verification, backward compatibility, tracing, documentation, scientific analysis, general-purpose).

Verdict: ✅ APPROVE (with minor suggestions)

No blocking issues. The lexists migration is correct and well-motivated — os.path.exists() returns False for symlinks to missing targets, causing spurious re-downloads that break deduplication.


Severity-Rated Findings

MEDIUM (edge cases, non-blocking)

  1. VERIFY_MEDIA silently skips all symlinks_verify_and_redownload_media now does if os.path.islink(file_path): continue, meaning dangling symlinks (genuinely broken dedup references) are never flagged or repaired during verification. Consider a logger.debug or optional VERIFY_SYMLINKS=true env var for users who want to audit broken links.

  2. Cross-chat dedup can create dangling symlinks — When a file is first seen in chat B but the _shared/ canonical copy was created by chat A, the symlink in chat B points to _shared/. If the user later deletes chat A's backup directory (which contains _shared/), chat B's symlink becomes dangling. This is an inherent tradeoff of the dedup design, not a bug in this PR — but worth a brief README note.

LOW (style/docs, no functional impact)

  1. PR title uses ENH: prefix — Project convention is Conventional Commits (fix:, feat:, etc.). This is a bugfix (prevents spurious re-downloads), so fix: would be more accurate.

  2. _ensure_profile_photo restructured logic — The old needs_download variable pattern was replaced with lexists + islink or getsize > 0. The new logic is correct but slightly harder to follow at a glance. A one-line comment like # Symlink is valid regardless of target existence could help future readers.

  3. Test assertionstest_symlink_dedup.py thoroughly covers the happy path (symlink preserved, no re-download). Could benefit from one negative test: what happens when lexists is True but the file is neither a symlink nor >0 bytes (empty regular file)? Current code already handles this correctly (falls through to download), but an explicit test would document the behavior.


Suggestions (non-blocking)

  • Consider adding a brief note to the README's VERIFY_MEDIA section clarifying that symlinked media is intentionally skipped during verification (since the canonical copy is the source of truth).
  • For the _shared/ dangling symlink scenario: a logger.warning when creating a symlink whose target doesn't exist yet would help operators diagnose issues without adding runtime cost.

What Looks Good

  • Consistent lexists migration across both telegram_backup.py and listener.py — no path was missed.
  • Tests cover the critical scenario (symlink to missing target is not re-downloaded).
  • No security implications (no path traversal, no TOCTOU race conditions introduced).
  • Backward compatible — existing non-symlink setups are unaffected (lexists behaves identically to exists for regular files).
  • Docker volume mounts unaffected (relative symlinks within the backup tree resolve correctly inside containers).

Reviewed by 11 specialized agents (architecture, security, code-review, testing, debugging, verification, backward-compat, tracing, documentation, scientific, general-purpose).

@yarikoptic
Copy link
Copy Markdown
Contributor

could other workflows run before I direct my HI + AI over responses?

Adds an explicit negative test for `_ensure_profile_photo`: when
`avatar_path` lexists as a 0-byte regular file (e.g. left by a prior
interrupted download), the gate must fall through and trigger a fresh
download rather than trust the empty entry.

The existing tests covered the symlink-preserved and missing-path
branches but not this third branch, which the multi-agent review on
PR GeiserX#146 (issuecomment-4392222710) flagged as worth documenting via
test.

Co-Authored-By: Claude Code 2.1.132 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yarikoptic-gitmate
Copy link
Copy Markdown
Author

Thanks for the consolidated multi-agent review! Going through the findings:

Medium

  1. VERIFY_MEDIA silently skips all symlinks. Agreed — and this is exactly the trade-off called out in the PR description's "Notes for review": the unconditional rescue path was the actual mutation source observed in [Question]: Account for previously downloaded (but now symlinks) files #143, so this PR makes "leave symlinks alone" the default. Re-introducing dangling-symlink recovery behind an opt-in env var (e.g. VERIFY_MEDIA_REDOWNLOAD_DANGLING=true, default false) is intentionally being held as a follow-up so that this PR stays narrowly scoped to the bug fix. Happy to file the follow-up issue right after this lands; the audit-only logger.debug is also a small enough nudge that I can fold it in here if you'd prefer.

  2. Cross-chat dedup can produce dangling symlinks if a chat directory is later deleted. Real concern, but it pre-dates this PR and is a property of the dedup design from feat: content-hash dedup for cross-type media duplicates #139, not the lexists migration. I'd rather document it in a focused follow-up alongside feat: content-hash dedup for cross-type media duplicates #139's design notes than expand the README in this PR. Happy to file a tracking issue.

Low

  1. PR title prefix ENH: vs Conventional fix:. Fair — the four commits already use fix: / docs:. Renaming the PR title to fix: idempotent rerun on git-annex'ed archives (symlinks etc).

  2. One-line comment in _ensure_profile_photo. There's already a 7-line comment block immediately above the gate (src/telegram_backup.py:1359-1370) explaining lexists semantics, the broken-but-intentional symlink case, and the ENOENT failure mode from [Question]: Account for previously downloaded (but now symlinks) files #143 — I think a one-liner on top of that block would actually be redundant. Open to a more specific suggestion if you had a particular phrasing in mind.

  3. Negative test for empty-regular-file fallthrough. Good catch — the existing tests covered the symlink-preserved and missing-path branches but not the "lexists True, neither symlink nor non-empty" branch that falls through to download. Added in 3ef30ab test(avatars): cover empty-regular-file fallthrough in lexists gate (a new test_empty_regular_file_avatar_triggers_download in tests/test_telegram_backup_extended.py).

Non-blocking suggestions

Let me know if any of the deferred items should be folded into this PR after all.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/test_listener_extended.py (1)

786-835: ⚡ Quick win

Make remaining avatar-path tests patch os.path.lexists instead of os.path.exists.

After the production guard switched to lexists, tests still patching exists don’t control the branch condition and can become environment-dependent.

Suggested test adjustment
-    `@patch`("os.path.exists", return_value=False)
+    `@patch`("os.path.lexists", return_value=False)
     async def test_downloads_avatar_when_file_missing(...):

-    `@patch`("os.path.exists", return_value=True)
+    `@patch`("os.path.lexists", return_value=True)
     `@patch`("os.path.getsize", return_value=0)
     async def test_downloads_avatar_when_file_empty(...):

-    `@patch`("os.path.exists", return_value=False)
+    `@patch`("os.path.lexists", return_value=False)
     async def test_handles_none_download_result(...):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_listener_extended.py` around lines 786 - 835, Update the failing
avatar-path tests to patch os.path.lexists (not os.path.exists) so the
production guard in TelegramListener._download_avatar is controlled;
specifically change the `@patch` target in
test_downloads_avatar_when_file_missing, test_downloads_avatar_when_file_empty
and test_handles_none_download_result (and any other tests in this file still
patching exists) from "os.path.exists" to "os.path.lexists" so the branch that
checks lexists is exercised deterministically and the mocked return values take
effect.
src/telegram_backup.py (1)

648-652: ⚡ Quick win

Expose skipped symlink counts in verify output.

Verify can now return “no issues found” after intentionally bypassing symlink records. Adding an aggregated skipped_symlinks count would make that new contract visible without leaking per-chat details.

Suggested diff
         missing_files = []
         corrupted_files = []
+        skipped_symlinks = 0
@@
             if os.path.islink(file_path):
+                skipped_symlinks += 1
                 continue
@@
         total_issues = len(missing_files) + len(corrupted_files)
         if total_issues == 0:
-            logger.info("✓ All media files verified - no issues found")
+            logger.info(
+                "✓ All media files verified - no issues found%s",
+                f" ({skipped_symlinks} symlink entries skipped)" if skipped_symlinks else "",
+            )
             logger.info("=" * 60)
             return
As per coding guidelines, "src/**/*.py: Never log chat IDs, topic IDs, or topic titles — log only aggregated counts to avoid PII exposure".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/telegram_backup.py` around lines 648 - 652, The verify routine currently
skips symlinks in the islink(file_path) branch without accounting for them; add
an integer counter (e.g., skipped_symlinks) in the verify function (the function
containing the os.path.islink(file_path) check) and increment it each time you
continue for a symlink, then include this aggregated count in the function's
returned result or verification summary (e.g., add "skipped_symlinks" to the
returned dict/object) so callers/logs can see the total skipped without exposing
per-chat or per-file details; ensure any logging uses only the aggregated count
(no chat IDs, topic IDs, or filenames).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/telegram_backup.py`:
- Around line 648-652: The verify routine currently skips symlinks in the
islink(file_path) branch without accounting for them; add an integer counter
(e.g., skipped_symlinks) in the verify function (the function containing the
os.path.islink(file_path) check) and increment it each time you continue for a
symlink, then include this aggregated count in the function's returned result or
verification summary (e.g., add "skipped_symlinks" to the returned dict/object)
so callers/logs can see the total skipped without exposing per-chat or per-file
details; ensure any logging uses only the aggregated count (no chat IDs, topic
IDs, or filenames).

In `@tests/test_listener_extended.py`:
- Around line 786-835: Update the failing avatar-path tests to patch
os.path.lexists (not os.path.exists) so the production guard in
TelegramListener._download_avatar is controlled; specifically change the `@patch`
target in test_downloads_avatar_when_file_missing,
test_downloads_avatar_when_file_empty and test_handles_none_download_result (and
any other tests in this file still patching exists) from "os.path.exists" to
"os.path.lexists" so the branch that checks lexists is exercised
deterministically and the mocked return values take effect.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6fab4e41-2275-4e32-a801-d6064f85ffcb

📥 Commits

Reviewing files that changed from the base of the PR and between 8a88cb1 and 3ef30ab.

📒 Files selected for processing (6)
  • README.md
  • src/listener.py
  • src/telegram_backup.py
  • tests/test_listener_extended.py
  • tests/test_symlink_dedup.py
  • tests/test_telegram_backup_extended.py

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.00%. Comparing base (8a88cb1) to head (3ef30ab).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #146      +/-   ##
==========================================
- Coverage   94.07%   94.00%   -0.07%     
==========================================
  Files          21       21              
  Lines        6127     6139      +12     
==========================================
+ Hits         5764     5771       +7     
- Misses        363      368       +5     
Files with missing lines Coverage Δ
src/listener.py 97.56% <100.00%> (-0.43%) ⬇️
src/telegram_backup.py 96.02% <100.00%> (-0.09%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@GeiserX GeiserX merged commit 867a741 into GeiserX:main May 7, 2026
9 checks passed
GeiserX added a commit that referenced this pull request May 7, 2026
- Patch os.path.lexists (not os.path.exists) in avatar tests to match
  the production guard in _download_avatar
- Add skipped_symlinks counter to verify output so users see how many
  symlink entries were intentionally bypassed
- Bump version to 7.7.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question]: Account for previously downloaded (but now symlinks) files

3 participants