ENH: idempotent rerun on git-annex'ed archives (symlinks etc)#146
ENH: idempotent rerun on git-annex'ed archives (symlinks etc)#146GeiserX merged 5 commits intoGeiserX:mainfrom
Conversation
The "do I already have this media?" check in `_process_media` and the
listener's `_download_media` used `os.path.exists`, which follows the
symlink chain to its ultimate target. For archived layouts where media
files are symlinks (e.g. into `.git/annex/objects/...`), the target may
be unreachable from the running process -- typical when the Docker /
Podman bind mount only covers the working tree but not `.git/`. In that
case `os.path.exists` returns False even though the symlink is intact,
and the script enters the "first-time download" branch:
1. `_shared/<name>` is overwritten via atomic rename (.part -> .),
replacing a git-annex symlink with a freshly-downloaded regular
file (visible as `typechange:` in `git status`).
2. Content-hash dedup (GeiserX#139) then queries the DB for a SHA-256 match
and may rewrite the chat-dir symlink to point at a different
canonical blob processed earlier in the same run -- a non-
deterministic mutation of the working tree across reruns.
Switch the gate to `os.path.lexists`, which is True for any symlink
regardless of target validity. A previously recorded symlink now
short-circuits the entire download path, preserving the user's archived
layout byte-for-byte. Hashing of the shared blob is gated behind
`os.path.exists` so we don't crash on broken links.
Mirrors the change in both the scheduled backup flow
(`src/telegram_backup.py`) and the real-time listener
(`src/listener.py`).
Existing tests that asserted the old "replace dangling symlink" behavior
have been updated to assert the new "preserve existing symlink" contract,
which is what idempotent rerun requires.
Refs: GeiserX#143
Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an avatar file in `media/avatars/{users,chats}/` is a symlink whose
ultimate target is unreachable from the running process (e.g. a
git-annex object outside the bind mount), `os.path.exists(avatar_path)`
returns False, so the previous "needs_download" gate concluded the
avatar was missing and called `download_profile_photo(file=avatar_path)`.
That call followed the symlink while opening for write and surfaced the
inner ENOENT as:
Failed to download avatar for <id>: [Errno 2] No such file or
directory: '/data/backups/media/avatars/chats/<id>_<photo_id>.jpg'
Switch the gate to `os.path.lexists`. A symlink already in place is
trusted unconditionally; a regular file is still re-downloaded only
when it is zero-byte (interrupted prior download). Mirrors the dedup-
gate change in the previous commit and applies to both the scheduled
backup (`_ensure_profile_photo`) and the listener's `_download_avatar`.
Refs: GeiserX#143
Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_verify_and_redownload_media` checks every previously recorded media file with `os.path.exists`, which follows symlink targets. For archived layouts where the actual content is held in an external store (git-annex `objects/`, lazy mount, separate filesystem) the symlink may be intact while its target is unreachable from the running process. The old behavior flagged such entries as "missing" and re-downloaded, atomic-renaming a freshly-downloaded regular file on top of the user's symlink and mutating the working tree (issue GeiserX#143). Two-part fix in the Phase 1 detection loop: 1. Use `os.path.lexists` to detect "truly missing" entries. A symlink in place -- even a broken one -- is no longer added to the missing-files list. 2. After the lexists check, short-circuit on `os.path.islink`. Symlink contents are managed externally, so we cannot meaningfully verify size or emptiness without following the link, and we will not replace a symlink we did not write. Truly absent paths (no entry on disk at all) are still re-downloaded. The cleanup loop in Phase 2 already used `os.path.lexists` since GeiserX#119, so the two phases are now consistent. Refs: GeiserX#143 Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backups committed to git-annex (or DataLad) end up as symlinks pointing into the repository's annex object store. Code changes in this series make the backup process trust those symlinks and skip re-downloads, but content-hash dedup can only match against existing `_shared/` blobs when their targets are reachable from the running process. If only the working tree is exposed to the backup (e.g. via a narrow bind mount), the annex object store sits outside that view and dedup misses. Add a Troubleshooting row + a short, deployment-agnostic subsection explaining the constraint and the general remedy (expose the repository root, then point the data path at the session subdirectory inside it) without prescribing a specific runtime. Refs: GeiserX#143 Co-Authored-By: Claude Code 2.1.128 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR updates backup and listener code to use os.path.lexists() so existing chat-directory symlinks (including dangling git-annex/DataLad layouts) are treated as authoritative during verification and deduplication; README troubleshooting was added and tests expanded to cover lexists-based behavior. ChangesSymlink-Aware Deduplication & Idempotent Re-runs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Multi-Agent Review — Consolidated FindingsReviewed from 11 specialized angles (architecture, security, code quality, testing, debugging, verification, backward compatibility, tracing, documentation, scientific analysis, general-purpose). Verdict: ✅ APPROVE (with minor suggestions)No blocking issues. The Severity-Rated FindingsMEDIUM (edge cases, non-blocking)
LOW (style/docs, no functional impact)
Suggestions (non-blocking)
What Looks Good
Reviewed by 11 specialized agents (architecture, security, code-review, testing, debugging, verification, backward-compat, tracing, documentation, scientific, general-purpose). |
|
could other workflows run before I direct my HI + AI over responses? |
Adds an explicit negative test for `_ensure_profile_photo`: when `avatar_path` lexists as a 0-byte regular file (e.g. left by a prior interrupted download), the gate must fall through and trigger a fresh download rather than trust the empty entry. The existing tests covered the symlink-preserved and missing-path branches but not this third branch, which the multi-agent review on PR GeiserX#146 (issuecomment-4392222710) flagged as worth documenting via test. Co-Authored-By: Claude Code 2.1.132 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the consolidated multi-agent review! Going through the findings: Medium
Low
Non-blocking suggestions
Let me know if any of the deferred items should be folded into this PR after all. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/test_listener_extended.py (1)
786-835: ⚡ Quick winMake remaining avatar-path tests patch
os.path.lexistsinstead ofos.path.exists.After the production guard switched to
lexists, tests still patchingexistsdon’t control the branch condition and can become environment-dependent.Suggested test adjustment
- `@patch`("os.path.exists", return_value=False) + `@patch`("os.path.lexists", return_value=False) async def test_downloads_avatar_when_file_missing(...): - `@patch`("os.path.exists", return_value=True) + `@patch`("os.path.lexists", return_value=True) `@patch`("os.path.getsize", return_value=0) async def test_downloads_avatar_when_file_empty(...): - `@patch`("os.path.exists", return_value=False) + `@patch`("os.path.lexists", return_value=False) async def test_handles_none_download_result(...):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_listener_extended.py` around lines 786 - 835, Update the failing avatar-path tests to patch os.path.lexists (not os.path.exists) so the production guard in TelegramListener._download_avatar is controlled; specifically change the `@patch` target in test_downloads_avatar_when_file_missing, test_downloads_avatar_when_file_empty and test_handles_none_download_result (and any other tests in this file still patching exists) from "os.path.exists" to "os.path.lexists" so the branch that checks lexists is exercised deterministically and the mocked return values take effect.src/telegram_backup.py (1)
648-652: ⚡ Quick winExpose skipped symlink counts in verify output.
Verify can now return “no issues found” after intentionally bypassing symlink records. Adding an aggregated
skipped_symlinkscount would make that new contract visible without leaking per-chat details.As per coding guidelines, "src/**/*.py: Never log chat IDs, topic IDs, or topic titles — log only aggregated counts to avoid PII exposure".Suggested diff
missing_files = [] corrupted_files = [] + skipped_symlinks = 0 @@ if os.path.islink(file_path): + skipped_symlinks += 1 continue @@ total_issues = len(missing_files) + len(corrupted_files) if total_issues == 0: - logger.info("✓ All media files verified - no issues found") + logger.info( + "✓ All media files verified - no issues found%s", + f" ({skipped_symlinks} symlink entries skipped)" if skipped_symlinks else "", + ) logger.info("=" * 60) return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/telegram_backup.py` around lines 648 - 652, The verify routine currently skips symlinks in the islink(file_path) branch without accounting for them; add an integer counter (e.g., skipped_symlinks) in the verify function (the function containing the os.path.islink(file_path) check) and increment it each time you continue for a symlink, then include this aggregated count in the function's returned result or verification summary (e.g., add "skipped_symlinks" to the returned dict/object) so callers/logs can see the total skipped without exposing per-chat or per-file details; ensure any logging uses only the aggregated count (no chat IDs, topic IDs, or filenames).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/telegram_backup.py`:
- Around line 648-652: The verify routine currently skips symlinks in the
islink(file_path) branch without accounting for them; add an integer counter
(e.g., skipped_symlinks) in the verify function (the function containing the
os.path.islink(file_path) check) and increment it each time you continue for a
symlink, then include this aggregated count in the function's returned result or
verification summary (e.g., add "skipped_symlinks" to the returned dict/object)
so callers/logs can see the total skipped without exposing per-chat or per-file
details; ensure any logging uses only the aggregated count (no chat IDs, topic
IDs, or filenames).
In `@tests/test_listener_extended.py`:
- Around line 786-835: Update the failing avatar-path tests to patch
os.path.lexists (not os.path.exists) so the production guard in
TelegramListener._download_avatar is controlled; specifically change the `@patch`
target in test_downloads_avatar_when_file_missing,
test_downloads_avatar_when_file_empty and test_handles_none_download_result (and
any other tests in this file still patching exists) from "os.path.exists" to
"os.path.lexists" so the branch that checks lexists is exercised
deterministically and the mocked return values take effect.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6fab4e41-2275-4e32-a801-d6064f85ffcb
📒 Files selected for processing (6)
README.mdsrc/listener.pysrc/telegram_backup.pytests/test_listener_extended.pytests/test_symlink_dedup.pytests/test_telegram_backup_extended.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #146 +/- ##
==========================================
- Coverage 94.07% 94.00% -0.07%
==========================================
Files 21 21
Lines 6127 6139 +12
==========================================
+ Hits 5764 5771 +7
- Misses 363 368 +5
🚀 New features to boost your workflow:
|
- Patch os.path.lexists (not os.path.exists) in avatar tests to match the production guard in _download_avatar - Add skipped_symlinks counter to verify output so users see how many symlink entries were intentionally bypassed - Bump version to 7.7.5
Re-runs of
python -m src.telegram_backupagainst an archived media tree(committed to git-annex / DataLad, where files are symlinks pointing into
.git/annex/objects/) currently mutate the working tree on every run:_shared/<name>symlinks get atomic-renamed and become regular files(visible as
typechangeingit status),_shared/blobs acrossruns because content-hash dedup (feat: content-hash dedup for cross-type media duplicates #139) picks a non-deterministic
canonical name from the cross-chat duplicate set.
Root cause: every "do I already have this?" gate uses
os.path.exists,which follows symlinks to the bottom. When the ultimate target lives
outside the running process's filesystem view (e.g. the bind mount only
covers the working tree, not
.git/), the gate concludes the file ismissing and triggers a re-download that overwrites the symlink. See
#143 for the full diagnosis with line references.
Changes
Each commit is a single conceptual change so they can be reviewed (or
backed out) independently.
fix(media): trust existing symlinks via os.path.lexists in dedup gate— switches both arms of the dedup gate (chat path and_shared/-path) fromos.path.existstoos.path.lexistsin_process_media(src/telegram_backup.py) and the listener twin insrc/listener.py. Updates the existing dangling-symlink tests intests/test_symlink_dedup.pyand the listener-side tests intests/test_listener_extended.pyto assert the new "preserve existing symlink" contract instead of the old "replace dangling symlink" one.fix(avatars): trust existing avatar symlinks via os.path.lexists— same gate change in_ensure_profile_photoand the listener's_download_avatar. Adds a regression test (test_existing_symlink_avatar_is_preserved) that reproduces the exact[Errno 2] No such file or directorylog line from the issue.fix(verify): preserve symlinks during VERIFY_MEDIA scans— Phase 1 of_verify_and_redownload_medianow useslexistsfor missing-file detection and short-circuits onos.path.islink, so an intact symlink is never re-downloaded. Truly absent paths still trigger recovery. New tests cover both branches.docs(readme): note bind-mount caveat for git-annex / DataLad backups— adds a Troubleshooting row + a short subsection showing the recommended-v $PWD:/repo+-w /repo/ses-acctmount layout so the symlink targets resolve from inside the container.Test plan
python -m pytest tests/(excluding the FastAPI/pydantic-only viewer suites, same as CI scope) —1419 passed, 153 warnings, 6 subtests passedruff check . && ruff format --check .— cleanNotes for review
_verify_and_redownload_mediaused to "rescue" dangling symlinks by re-downloading. That recovery is unsafe for archived layouts and was the actual mutation source observed in [Question]: Account for previously downloaded (but now symlinks) files #143; users on plain installs are unaffected (their entries are regular files, not symlinks). If preserving the rescue path is important, we can gate it behind an env var (e.g.VERIFY_MEDIA_REDOWNLOAD_DANGLING=true, default false) — happy to add as a follow-up.lexistsis load-bearing.Summary by CodeRabbit
Documentation
Bug Fixes
Tests