feat: reuse compressor for each warc record#180
Conversation
…ndling with new compressionType struct
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce the overhead of WARC writing by reusing a compressor instance across records (instead of constructing a new compressor per record), and refactors compression handling away from raw strings into typed values.
Changes:
- Introduces a
Compressorabstraction onWriterand addsWriter.Reset()to reuse the same compressor across WARC records. - Replaces string-based compression selection with a
compressionTypevalue (CompressionNone/Gzip/Zstd) across rotator/settings and filename generation. - Updates tests to use the new compression constants.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| write.go | Replaces per-format compressor fields with a single reusable Compressor interface on Writer. |
| warc.go | Adds typed compression values, adds Writer.Reset(), and updates rotation/record writing logic to reuse compressors. |
| utils.go | Refactors NewWriter to build Writer based on typed compression and to support compressor reuse. |
| file.go | Updates WARC filename generation to use typed compression values. |
| gzip_interface.go | Extends gzip writer interface to include Reset(io.Writer) for reuse. |
| *_test.go | Refactors tests to use CompressionGzip/CompressionZstd constants. |
Comments suppressed due to low confidence (1)
warc_test.go:19
- The assertion in this test is inverted: it reports an error when the filename does contain all expected substrings. As written, the test will pass even if the filename is wrong (e.g. because the condition is rarely true due to the serial substring). Flip the condition (and consider asserting the expected serial, which appears to start at 1, i.e.
00001, not00000).
split := []string{"test", "00000", ".warc.gz.open"}
if strings.Contains(result, split[0]) && strings.Contains(result, split[1]) && strings.Contains(result, split[2]) {
t.Errorf("Expected filename to contain %s, %s, and %s, but got %s", split[0], split[1], split[2], result)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I'll check copilot's comments later. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
NGTmeaty
left a comment
There was a problem hiding this comment.
These improvements are amazing! I actually had meant to look into this a few weeks ago but had some other priorities, so I really appreciate the PR!
Oh, and before I forget, thank you for the cleanup!
| var err error | ||
| eopts := []zstd.EOption{zstd.WithEncoderLevel(zstd.SpeedBetterCompression)} | ||
| if len(dictionary) > 0 { | ||
| eopts = append(eopts, zstd.WithEncoderDict(dictionary)) |
| // If the [Compressor] is nil, this will just flush the [Writer.BufWriter]. | ||
| func (w *Writer) FlushAndCloseCompressor() (err error) { | ||
| if w.Compressor != nil { | ||
| err1 := w.BufWriter.Flush() |
There was a problem hiding this comment.
I would prefer a more descriptive error variables here, but not a big deal.
|
I'm going to merge, but feel free to change those error variables (or anything else) in another commit or PR. |


issue: we create compressor for each warc record, this is very expensive.