Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds file-download deduplication for slackdump resume by persisting Slack file sizes in the SQLite archive and using (file_id, size) to detect already-recorded files before downloading again.
Changes:
- Add
SIZEcolumn (and index) to theFILEtable via a new goose migration. - Extend the DB file model/repository with
Sizeand aGetByIDAndSizelookup method. - Add a
DeduplicatingFileProcessorwrapper and wire it into the resume controller path.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql | Adds FILE.SIZE column and an (ID, SIZE) index to support dedup lookups. |
| internal/chunk/backend/dbase/repository/dbfile.go | Persists slack.File.Size, adds repository method for dedup lookup. |
| internal/chunk/backend/dbase/repository/mock_repository/mock_file.go | Regenerates/extends mock to include GetByIDAndSize. |
| internal/convert/transform/fileproc/dedup.go | Implements DB-backed deduplicating filer wrapper. |
| internal/convert/transform/fileproc/dedup_test.go | Adds placeholder tests; currently skips the main behavior test. |
| cmd/slackdump/internal/archive/archive.go | Wraps the filer with dedup logic for resume. |
| internal/fixtures/assets/source_database.db | Updates/adds a DB fixture reflecting the new schema (binary). |
internal/chunk/backend/dbase/repository/migrations/20260308000000_file_size.sql
Outdated
Show resolved
Hide resolved
4f3b5e6 to
f420988
Compare
|
@rusq changes to the PR comments have been applied. The optimization of existing files hasn't been done since it can backfire with Slack instances that have a lot of files, though. |
|
Thanks! I'm eyeballing the code, but it takes some time. My current issue is that it will call the file processor that inserts the file row, and only then it calls GetFileByNameAndSize, which would return this row. I may add to this branch as I figure out the best way to go around it without rearchitecting the controller. |
Can you please point me to the exact issue here? I checked the code 3x and it checks the DB before downloading anything and only writes to the DB after a download happened (old way). |
I was wrong, the dedup filer executes before the actual Conversation Recorder's filer, sorry about false positive. I also found a problem with V_EMPTY_THREADS where it references a non-existent MESSAGE.SESSION_ID column, will fix in follow up PR. The DROP COLUMN was breaking because of the invalid reference in that view, appears that setting writable_schema allows the op to proceed ignoring the schema problem. |
On
slackdump resumefiles get downloaded over and over again even if they exist already on disk.This PR adds the filesize to sqlite to compare it in the future with new downloads. The Slack API returns the file size itself and not a checksum. The ID should also be unique based on upload, therefore the filesize is more optional but could be used in the future to compare files on disk with the DB.
For real world testing I used the
-vflag on resume and it worked fine