perf: unroll loops (saves ~20-40% cpu) by arturmelanchyk · Pull Request #158 · gofiber/utils

arturmelanchyk · 2025-11-03T14:35:11Z

Benchmarks on AMD EPYC

goos: linux
goarch: amd64
pkg: github.com/gofiber/utils/v2
cpu: AMD EPYC-Milan Processor
                          │   old.txt   │               new.txt               │
                          │   sec/op    │   sec/op     vs base                │
_ToLowerBytes/fiber-2       45.19n ± 1%   26.96n ± 0%  -40.34% (p=0.000 n=20)
_ToLowerBytes/default-2     266.1n ± 1%   265.9n ± 1%        ~ (p=0.358 n=20)
_ToUpperBytes/fiber-2       26.87n ± 0%   26.84n ± 0%        ~ (p=0.067 n=20)
_ToUpperBytes/default-2     297.9n ± 0%   298.1n ± 0%        ~ (p=0.815 n=20)
_EqualFoldBytes/fiber-2     44.91n ± 1%   35.39n ± 0%  -21.21% (p=0.000 n=20)
_EqualFoldBytes/default-2   154.8n ± 1%   156.6n ± 2%        ~ (p=0.298 n=20)
_EqualFold/fiber-2          45.42n ± 1%   34.74n ± 0%  -23.52% (p=0.000 n=20)
_EqualFold/default-2        158.5n ± 1%   159.1n ± 1%        ~ (p=0.372 n=20)
geomean                     91.27n        80.46n       -11.85%

                          │   old.txt    │               new.txt               │
                          │     B/op     │    B/op     vs base                 │
_ToLowerBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-2     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-2     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-2          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-2        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                          │   old.txt    │               new.txt               │
                          │  allocs/op   │ allocs/op   vs base                 │
_ToLowerBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-2          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-2        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Benchmarks on Apple M3 Max

goos: darwin
goarch: arm64
pkg: github.com/gofiber/utils/v2
cpu: Apple M3 Max
                           │   old.txt   │               new.txt               │
                           │   sec/op    │   sec/op     vs base                │
_ToLowerBytes/fiber-16       22.19n ± 1%   18.06n ± 1%  -18.59% (p=0.000 n=20)
_ToLowerBytes/default-16     75.92n ± 1%   76.93n ± 1%   +1.32% (p=0.002 n=20)
_ToUpperBytes/fiber-16       26.19n ± 0%   17.95n ± 1%  -31.46% (p=0.000 n=20)
_ToUpperBytes/default-16     81.73n ± 0%   81.56n ± 1%        ~ (p=0.331 n=20)
_EqualFoldBytes/fiber-16     29.25n ± 0%   31.05n ± 1%   +6.17% (p=0.000 n=20)
_EqualFoldBytes/default-16   58.96n ± 0%   61.63n ± 1%   +4.53% (p=0.000 n=20)
_EqualFold/fiber-16          39.42n ± 0%   30.68n ± 1%  -22.17% (p=0.000 n=20)
_EqualFold/default-16        58.99n ± 0%   59.39n ± 1%   +0.69% (p=0.000 n=20)
geomean                      44.16n        40.40n        -8.52%

                           │   old.txt    │               new.txt               │
                           │     B/op     │    B/op     vs base                 │
_ToLowerBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-16     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-16     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-16     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-16   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-16          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-16        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                 ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │   old.txt    │               new.txt               │
                           │  allocs/op   │ allocs/op   vs base                 │
_ToLowerBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-16     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-16     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-16     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-16   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-16          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-16        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                 ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Summary by CodeRabbit

Chores
- Internal performance optimizations for case-insensitive comparisons and case conversions via loop unrolling.
Tests
- Benchmarks renamed for clarity, split into sub-benchmarks and now include per-run validation to better track performance.
Bug
- An invalid iteration form was introduced that can cause build failures; a fix or revert is required to restore successful builds.

coderabbitai · 2025-11-03T14:35:20Z

Walkthrough

Replaces per-element loops with 4-item unrolled loops in ASCII case-conversion and case-insensitive comparison; updates IPv4 loop to an incorrect range form; splits IP benchmarks into sub-benchmarks and adds assertions. No exported APIs changed.

Changes

Cohort / File(s)	Summary
Loop unrolling — byte case conversion `bytes.go`	Replaced per-byte range loop in `ToLowerBytes` / `ToUpperBytes` with an explicit 4-byte unrolled loop using a local table alias and a residual tail loop; behavior and signatures unchanged.
Loop unrolling — case-insensitive compare `byteseq.go`	Replaced simple index-based equality loop with a 4-iteration unrolled block that uses a local table reference plus a tail loop; preserves behavior and public API.
IPv4 iteration change (potentially invalid) `ips.go`	Changed `for i := 0; i < net.IPv4len; i++ {` to `for i := range net.IPv4len {` (converts index-bound loop to a range over an integer constant) — this form is likely incorrect/compilable issue.
Benchmarks — split sub-benchmarks & assertions `ips_test.go`	Converted `Benchmark_IsIPv4` and `Benchmark_IsIPv6` to run two sub-benchmarks (`"fiber"` and `"default"` using `net.ParseIP`), added per-sub-benchmark result checks via `require.True`.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ToLower_old as ToLowerBytes (old)
    participant ToLower_new as ToLowerBytes (new)
    note right of ToLower_old #f6f8ff: Old flow — simple per-byte loop
    Caller->>ToLower_old: call(bytes)
    ToLower_old->>ToLower_old: for i { out[i]=table[bytes[i]] }
    ToLower_old-->>Caller: result

    note right of ToLower_new #f0fff4: New flow — unrolled 4-byte blocks + tail
    Caller->>ToLower_new: call(bytes)
    ToLower_new->>ToLower_new: for i += 4 { map b0..b3 via table -> out[i..i+3] }
    ToLower_new->>ToLower_new: for remaining { map b -> out[i] }
    ToLower_new-->>Caller: result

sequenceDiagram
    participant Caller
    participant Eq_old as EqualFold (old)
    participant Eq_new as EqualFold (new)
    note right of Eq_old #fffaf0: Old flow — per-index compare
    Caller->>Eq_old: call(a,b)
    Eq_old->>Eq_old: for i { map[a[i]] == map[b[i]] ? continue : return false }
    Eq_old-->>Caller: true/false

    note right of Eq_new #f9f8ff: New flow — unrolled compare + tail
    Caller->>Eq_new: call(a,b)
    Eq_new->>Eq_new: for i += 4 { compare mapped a[i..i+3] to b[i..i+3] }
    Eq_new->>Eq_new: for remaining { compare mapped a[i] to b[i] }
    Eq_new-->>Caller: true/false

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify correctness for inputs with length not divisible by 4 and for very small lengths.
Inspect ips.go change — for i := range net.IPv4len appears invalid; confirm intended loop and restore index-based loop.
Confirm unrolled loops preserve exact mapping and have no subtle bounds or aliasing bugs.
Review benchmarks (ips_test.go) for correct result assertions and side effects.

Possibly related PRs

🧹 chore: Expand test coverage #136 — modifies/to-tests same ToLowerBytes/ToUpperBytes and IP-related tests; likely directly related.
Improve utils performance #116 — earlier ToLower/ToUpper optimizations and unrolling changes overlap the same code paths.

Suggested reviewers

gaby
sixcolors
ReneWerner87
efectn

Poem

🐇 I hopped through bytes in rhythmic score,

Four at once, then one — then four once more.
Maps stay the same, I nudge the pace,
Benchmarks split, a tiny race.
A rabbit smiles — faster, with grace.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'perf: unroll loops (saves ~20-40% cpu)' accurately reflects the main objective of the changeset. The code modifications across bytes.go, byteseq.go, and ips.go all involve replacing simple loops with unrolled loop implementations that process multiple bytes per iteration. The performance claim is substantiated by the benchmark results documented in the PR objectives, showing geomean latency improvements of 8-12% and specific function improvements up to 40%. The title is concise, clear, and directly summarizes the primary change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb8645b and 9ef41fd.

📒 Files selected for processing (1)

ips.go (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

ips.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build (1.24.x, windows-latest)
GitHub Check: Build (1.23.x, windows-latest)
GitHub Check: Compare

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-11-03T14:36:46Z

Codecov Report

❌ Patch coverage is 86.66667% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.31%. Comparing base (69c3d29) to head (9ef41fd).
⚠️ Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
byteseq.go	68.00%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
+ Coverage   83.25%   83.31%   +0.05%     
==========================================
  Files          11       11              
  Lines         848      755      -93     
==========================================
- Hits          706      629      -77     
+ Misses        135      115      -20     
- Partials        7       11       +4

Flag	Coverage Δ
unittests	`83.31% <86.66%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ReneWerner87 · 2025-11-03T15:19:21Z

Nice catch

arturmelanchyk · 2025-11-03T15:42:41Z

@ReneWerner87 modernize lint seems unrelated to the changes made in the PR

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

bytes.go (1)

37-65: LGTM! Duplication is acceptable for this hot path.

The implementation mirrors ToLowerBytes exactly (differing only in the table used). While this creates duplication, the simplicity aids compiler optimization and inlining. Given the 18–31% performance improvements shown in the benchmarks, the duplication is justified for this performance-critical code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69c3d29 and 9ea03e1.

📒 Files selected for processing (2)

bytes.go (1 hunks)
byteseq.go (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-18T00:32:59.671Z

Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.

Applied to files:

byteseq.go

🔇 Additional comments (2)

byteseq.go (1)

12-54: LGTM! Well-optimized loop unrolling with correct bounds handling.

The 4x unrolled comparison loop is implemented correctly:

The limit := n &^ 3 computation properly rounds down to the nearest multiple of 4, ensuring all accesses b[i+0] through b[i+3] are in bounds when i < limit

Early return on mismatch in each unrolled block maintains correctness

Tail loop correctly handles the remaining 0–3 elements

Local table reference may help the compiler with optimization

The benchmarks confirm meaningful speedups (−21% to −23% on EqualFold/fiber). The increased verbosity is justified by the performance gains.

bytes.go (1)

8-35: LGTM! Correct loop unrolling with safe in-place mutation.

The 4x unrolled transformation is implemented correctly:

Loading all 4 bytes into locals (b0 through b3) before writing back ensures safe in-place mutation

The limit calculation and tail loop are correct (same pattern as in byteseq.go)

The benchmarks show substantial improvements (−40% on ToLowerBytes/fiber for AMD EPYC)

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>

arturmelanchyk · 2025-11-03T17:01:14Z

ips.go

 func IsIPv4(s string) bool {
 	//nolint:modernize-loop // old way is more readable
-	for i := 0; i < net.IPv4len; i++ {
+	for i := range net.IPv4len {


this was failing modernize lint https://github.com/gofiber/utils/actions/runs/19038208035/job/54368027903

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ea03e1 and eb8645b.

📒 Files selected for processing (2)

ips.go (1 hunks)
ips_test.go (2 hunks)

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.

📚 Learning: 2025-09-18T00:32:59.671Z

Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.

Applied to files:

ips.go

📚 Learning: 2025-10-22T13:48:12.485Z

Learnt from: CR
Repo: gofiber/fiber PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-22T13:48:12.485Z
Learning: Run benchmarks with `go test` (Make target: benchmark)

Applied to files:

ips_test.go

📚 Learning: 2024-10-16T12:12:30.506Z

Learnt from: gaby
Repo: gofiber/fiber PR: 3170
File: ctx_test.go:1721-1724
Timestamp: 2024-10-16T12:12:30.506Z
Learning: In the Go unit tests in `ctx_test.go`, it is acceptable to use invalid CIDR notation such as `"0.0.0.1/31junk"` for testing purposes.

Applied to files:

ips_test.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Compare

🔇 Additional comments (2)

ips_test.go (2)

104-122: LGTM! Improved benchmark structure.

The addition of sub-benchmarks for "fiber" vs "default" implementations allows for clear performance comparison between the custom IsIPv4 function and the standard library's net.ParseIP. The final assertions prevent compiler optimizations from eliminating the benchmarked code.

124-142: LGTM! Consistent benchmark improvements.

The benchmark structure mirrors the IPv4 changes appropriately, enabling performance comparison between the custom IsIPv6 implementation and the standard library.

ips.go

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>

arturmelanchyk · 2025-11-07T14:55:51Z

Hi @ReneWerner87 any plans on releasing this and upgrading dependency version in fiber repo?

ReneWerner87 · 2025-11-07T15:36:51Z

On the weekend

arturmelanchyk added 2 commits November 3, 2025 15:01

perf: unroll loop in ToLowerBytes/ToUpperBytes

b79dda8

perf: unroll loop in EqualFold

9ea03e1

arturmelanchyk changed the title ~~Unroll loops (saves ~20-40% cpu)~~ perf: unroll loops (saves ~20-40% cpu) Nov 3, 2025

ReneWerner87 added the 🧹 Updates label Nov 3, 2025

arturmelanchyk marked this pull request as ready for review November 3, 2025 15:42

arturmelanchyk requested a review from a team as a code owner November 3, 2025 15:42

arturmelanchyk requested review from ReneWerner87, efectn, gaby and sixcolors and removed request for a team November 3, 2025 15:42

coderabbitai bot reviewed Nov 3, 2025

View reviewed changes

chore: modernize loop in IsIPv4

eb8645b

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>

arturmelanchyk commented Nov 3, 2025

View reviewed changes

coderabbitai bot reviewed Nov 3, 2025

View reviewed changes

ips.go Show resolved Hide resolved

chore: modernize loop in IsIPv4

9ef41fd

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>

ReneWerner87 approved these changes Nov 3, 2025

View reviewed changes

arturmelanchyk merged commit 9417f80 into gofiber:master Nov 3, 2025
19 checks passed

This was referenced Feb 6, 2026

perf: performance optimizations #184

Merged

semantics: explicit mutating APIs, safe replacements #186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: unroll loops (saves ~20-40% cpu)#158

perf: unroll loops (saves ~20-40% cpu)#158
arturmelanchyk merged 4 commits intogofiber:masterfrom
arturmelanchyk:unroll-loop

arturmelanchyk commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

ReneWerner87 commented Nov 3, 2025

Uh oh!

arturmelanchyk commented Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

arturmelanchyk Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

arturmelanchyk commented Nov 7, 2025

Uh oh!

ReneWerner87 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arturmelanchyk commented Nov 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ReneWerner87 commented Nov 3, 2025

Uh oh!

arturmelanchyk commented Nov 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

arturmelanchyk Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

arturmelanchyk commented Nov 7, 2025

Uh oh!

ReneWerner87 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arturmelanchyk commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 3, 2025 •

edited

Loading

codecov bot commented Nov 3, 2025 •

edited

Loading