Skip to content

Add file deduplication support#644

Merged
rusq merged 3 commits intorusq:masterfrom
volker-fr:dedup-downloads
Mar 28, 2026
Merged

Add file deduplication support#644
rusq merged 3 commits intorusq:masterfrom
volker-fr:dedup-downloads

Conversation

@volker-fr
Copy link
Copy Markdown
Contributor

@volker-fr volker-fr commented Mar 17, 2026

On slackdump resume files get downloaded over and over again even if they exist already on disk.

This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.

For real world testing I used the -v flag on resume and it worked fine

...
2026-03-16 22:32:17 DEBUG skipping duplicate file
                      ├ file_id: XXXXXXXXX
                      └ size: 123456

@rusq rusq requested a review from Copilot March 19, 2026 10:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.

Changes:

  • Add SIZE column (and index) to the FILE table via a new goose migration.
  • Extend the DB file model/repository with Size and a GetByIDAndSize lookup method.
  • Add a DeduplicatingFileProcessor wrapper and wire it into the resume controller path.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql Adds FILE.SIZE column and an (ID, SIZE) index to support dedup lookups.
internal/chunk/backend/dbase/repository/dbfile.go Persists slack.File.Size, adds repository method for dedup lookup.
internal/chunk/backend/dbase/repository/mock_repository/mock_file.go Regenerates/extends mock to include GetByIDAndSize.
internal/convert/transform/fileproc/dedup.go Implements DB-backed deduplicating filer wrapper.
internal/convert/transform/fileproc/dedup_test.go Adds placeholder tests; currently skips the main behavior test.
cmd/slackdump/internal/archive/archive.go Wraps the filer with dedup logic for resume.
internal/fixtures/assets/source_database.db Updates/adds a DB fixture reflecting the new schema (binary).

@volker-fr
Copy link
Copy Markdown
Contributor Author

@rusq changes to the PR comments have been applied.

The optimization of existing files hasn't been done since it can backfire with Slack instances that have a lot of files, though.

@rusq
Copy link
Copy Markdown
Owner

rusq commented Mar 22, 2026

Thanks! I'm eyeballing the code, but it takes some time. My current issue is that it will call the file processor that inserts the file row, and only then it calls GetFileByNameAndSize, which would return this row. I may add to this branch as I figure out the best way to go around it without rearchitecting the controller.

@volker-fr
Copy link
Copy Markdown
Contributor Author

My current issue is that it will call the file processor that inserts the file row, and only then it calls GetFileByNameAndSize, which would return this row.

Can you please point me to the exact issue here? I checked the code 3x and it checks the DB before downloading anything and only writes to the DB after a download happened (old way).

@rusq
Copy link
Copy Markdown
Owner

rusq commented Mar 28, 2026

Can you please point me to the exact issue here? I checked the code 3x and it checks the DB before downloading anything and only writes to the DB after a download happened (old way).

I was wrong, the dedup filer executes before the actual Conversation Recorder's filer, sorry about false positive. I also found a problem with V_EMPTY_THREADS where it references a non-existent MESSAGE.SESSION_ID column, will fix in follow up PR.

The DROP COLUMN was breaking because of the invalid reference in that view, appears that setting writable_schema allows the op to proceed ignoring the schema problem.

@rusq rusq merged commit af9e1c3 into rusq:master Mar 28, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants