Add file deduplication support by volker-fr · Pull Request #644 · rusq/slackdump

volker-fr · 2026-03-17T03:41:46Z

On slackdump resume files get downloaded over and over again even if they exist already on disk.

This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.

For real world testing I used the -v flag on resume and it worked fine

...
2026-03-16 22:32:17 DEBUG skipping duplicate file
                      ├ file_id: XXXXXXXXX
                      └ size: 123456

Copilot

Pull request overview

This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.

Changes:

Add SIZE column (and index) to the FILE table via a new goose migration.
Extend the DB file model/repository with Size and a GetByIDAndSize lookup method.
Add a DeduplicatingFileProcessor wrapper and wire it into the resume controller path.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql	Adds `FILE.SIZE` column and an `(ID, SIZE)` index to support dedup lookups.
internal/chunk/backend/dbase/repository/dbfile.go	Persists `slack.File.Size`, adds repository method for dedup lookup.
internal/chunk/backend/dbase/repository/mock_repository/mock_file.go	Regenerates/extends mock to include `GetByIDAndSize`.
internal/convert/transform/fileproc/dedup.go	Implements DB-backed deduplicating filer wrapper.
internal/convert/transform/fileproc/dedup_test.go	Adds placeholder tests; currently skips the main behavior test.
cmd/slackdump/internal/archive/archive.go	Wraps the filer with dedup logic for `resume`.
internal/fixtures/assets/source_database.db	Updates/adds a DB fixture reflecting the new schema (binary).

internal/chunk/backend/dbase/repository/dbfile.go

internal/convert/transform/fileproc/dedup.go

internal/convert/transform/fileproc/dedup_test.go

cmd/slackdump/internal/archive/archive.go

internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql

internal/chunk/backend/dbase/repository/dbfile.go

volker-fr · 2026-03-19T23:31:59Z

@rusq changes to the PR comments have been applied.

The optimization of existing files hasn't been done since it can backfire with Slack instances that have a lot of files, though.

rusq · 2026-03-22T12:05:49Z

Thanks! I'm eyeballing the code, but it takes some time. My current issue is that it will call the file processor that inserts the file row, and only then it calls GetFileByNameAndSize, which would return this row. I may add to this branch as I figure out the best way to go around it without rearchitecting the controller.

volker-fr · 2026-03-22T16:04:59Z

My current issue is that it will call the file processor that inserts the file row, and only then it calls GetFileByNameAndSize, which would return this row.

Can you please point me to the exact issue here? I checked the code 3x and it checks the DB before downloading anything and only writes to the DB after a download happened (old way).

rusq · 2026-03-28T04:38:10Z

Can you please point me to the exact issue here? I checked the code 3x and it checks the DB before downloading anything and only writes to the DB after a download happened (old way).

I was wrong, the dedup filer executes before the actual Conversation Recorder's filer, sorry about false positive. I also found a problem with V_EMPTY_THREADS where it references a non-existent MESSAGE.SESSION_ID column, will fix in follow up PR.

The DROP COLUMN was breaking because of the invalid reference in that view, appears that setting writable_schema allows the op to proceed ignoring the schema problem.

rusq requested a review from Copilot March 19, 2026 10:01

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Add file deduplication support

f420988

volker-fr force-pushed the dedup-downloads branch from 4f3b5e6 to f420988 Compare March 19, 2026 23:28

rusq added 2 commits March 28, 2026 14:24

backfill the size column and add the drop column

1c49dad

add tests

4e7f14d

rusq merged commit af9e1c3 into rusq:master Mar 28, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add file deduplication support#644

Add file deduplication support#644
rusq merged 3 commits intorusq:masterfrom
volker-fr:dedup-downloads

volker-fr commented Mar 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

volker-fr commented Mar 19, 2026

Uh oh!

rusq commented Mar 22, 2026

Uh oh!

volker-fr commented Mar 22, 2026

Uh oh!

rusq commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

volker-fr commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

volker-fr commented Mar 19, 2026

Uh oh!

rusq commented Mar 22, 2026

Uh oh!

volker-fr commented Mar 22, 2026

Uh oh!

rusq commented Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

volker-fr commented Mar 17, 2026 •

edited

Loading