Skip to content

feat: reuse compressor for each warc record#180

Merged
NGTmeaty merged 12 commits into
masterfrom
reuse-compressor
Mar 6, 2026
Merged

feat: reuse compressor for each warc record#180
NGTmeaty merged 12 commits into
masterfrom
reuse-compressor

Conversation

@yzqzss

@yzqzss yzqzss commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator

issue: we create compressor for each warc record, this is very expensive.

image Only this is core change, the others are refactor.

@yzqzss yzqzss added the bug Something isn't working label Mar 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce the overhead of WARC writing by reusing a compressor instance across records (instead of constructing a new compressor per record), and refactors compression handling away from raw strings into typed values.

Changes:

  • Introduces a Compressor abstraction on Writer and adds Writer.Reset() to reuse the same compressor across WARC records.
  • Replaces string-based compression selection with a compressionType value (CompressionNone/Gzip/Zstd) across rotator/settings and filename generation.
  • Updates tests to use the new compression constants.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
write.go Replaces per-format compressor fields with a single reusable Compressor interface on Writer.
warc.go Adds typed compression values, adds Writer.Reset(), and updates rotation/record writing logic to reuse compressors.
utils.go Refactors NewWriter to build Writer based on typed compression and to support compressor reuse.
file.go Updates WARC filename generation to use typed compression values.
gzip_interface.go Extends gzip writer interface to include Reset(io.Writer) for reuse.
*_test.go Refactors tests to use CompressionGzip/CompressionZstd constants.
Comments suppressed due to low confidence (1)

warc_test.go:19

  • The assertion in this test is inverted: it reports an error when the filename does contain all expected substrings. As written, the test will pass even if the filename is wrong (e.g. because the condition is rarely true due to the serial substring). Flip the condition (and consider asserting the expected serial, which appears to start at 1, i.e. 00001, not 00000).
	split := []string{"test", "00000", ".warc.gz.open"}

	if strings.Contains(result, split[0]) && strings.Contains(result, split[1]) && strings.Contains(result, split[2]) {
		t.Errorf("Expected filename to contain %s, %s, and %s, but got %s", split[0], split[1], split[2], result)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread utils.go
Comment thread utils.go Outdated
Comment thread warc.go Outdated
Comment thread warc.go Outdated
Comment thread warc.go Outdated
Comment thread warc.go
@yzqzss yzqzss marked this pull request as draft March 2, 2026 20:46
@yzqzss

yzqzss commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator Author

I'll check copilot's comments later.

@yzqzss yzqzss self-assigned this Mar 2, 2026
@yzqzss yzqzss force-pushed the reuse-compressor branch from f05a2e9 to e1f9c0c Compare March 3, 2026 00:48
@yzqzss yzqzss marked this pull request as ready for review March 3, 2026 02:34
@yzqzss yzqzss requested a review from Copilot March 3, 2026 02:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread write.go Outdated
Comment thread warc.go Outdated
Comment thread file.go Outdated
Comment thread write.go Outdated
@yzqzss yzqzss marked this pull request as draft March 3, 2026 03:08
yzqzss and others added 4 commits March 3, 2026 11:22
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@yzqzss yzqzss marked this pull request as ready for review March 3, 2026 03:27
@yzqzss

yzqzss commented Mar 3, 2026

Copy link
Copy Markdown
Collaborator Author

before:

gzip: archived 3702 urls in 10s
51s cpu usage, a lots of GC actions.

image

after:

none, gzip, zstd: 35~40s cpu usage.
none: archived 3767 urls in 10s
gzip: archived 3793 urls in 10s
zstd: archived 3824 urls in 10s

zstd has the fewest GC actions.

Screenshots_20260303_040447

@NGTmeaty NGTmeaty left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These improvements are amazing! I actually had meant to look into this a few weeks ago but had some other priorities, so I really appreciate the PR!

Oh, and before I forget, thank you for the cleanup!

Comment thread utils.go
var err error
eopts := []zstd.EOption{zstd.WithEncoderLevel(zstd.SpeedBetterCompression)}
if len(dictionary) > 0 {
eopts = append(eopts, zstd.WithEncoderDict(dictionary))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this 😄

Comment thread warc.go
// If the [Compressor] is nil, this will just flush the [Writer.BufWriter].
func (w *Writer) FlushAndCloseCompressor() (err error) {
if w.Compressor != nil {
err1 := w.BufWriter.Flush()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more descriptive error variables here, but not a big deal.

@NGTmeaty

NGTmeaty commented Mar 6, 2026

Copy link
Copy Markdown
Collaborator

I'm going to merge, but feel free to change those error variables (or anything else) in another commit or PR.

@NGTmeaty NGTmeaty merged commit 3085677 into master Mar 6, 2026
6 checks passed
@NGTmeaty NGTmeaty deleted the reuse-compressor branch March 6, 2026 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants