Skip to content

perf: unroll loops (saves ~20-40% cpu)#158

Merged
arturmelanchyk merged 4 commits intogofiber:masterfrom
arturmelanchyk:unroll-loop
Nov 3, 2025
Merged

perf: unroll loops (saves ~20-40% cpu)#158
arturmelanchyk merged 4 commits intogofiber:masterfrom
arturmelanchyk:unroll-loop

Conversation

@arturmelanchyk
Copy link
Copy Markdown
Contributor

@arturmelanchyk arturmelanchyk commented Nov 3, 2025

Benchmarks on AMD EPYC

goos: linux
goarch: amd64
pkg: github.com/gofiber/utils/v2
cpu: AMD EPYC-Milan Processor
                          │   old.txt   │               new.txt               │
                          │   sec/op    │   sec/op     vs base                │
_ToLowerBytes/fiber-2       45.19n ± 1%   26.96n ± 0%  -40.34% (p=0.000 n=20)
_ToLowerBytes/default-2     266.1n ± 1%   265.9n ± 1%        ~ (p=0.358 n=20)
_ToUpperBytes/fiber-2       26.87n ± 0%   26.84n ± 0%        ~ (p=0.067 n=20)
_ToUpperBytes/default-2     297.9n ± 0%   298.1n ± 0%        ~ (p=0.815 n=20)
_EqualFoldBytes/fiber-2     44.91n ± 1%   35.39n ± 0%  -21.21% (p=0.000 n=20)
_EqualFoldBytes/default-2   154.8n ± 1%   156.6n ± 2%        ~ (p=0.298 n=20)
_EqualFold/fiber-2          45.42n ± 1%   34.74n ± 0%  -23.52% (p=0.000 n=20)
_EqualFold/default-2        158.5n ± 1%   159.1n ± 1%        ~ (p=0.372 n=20)
geomean                     91.27n        80.46n       -11.85%

                          │   old.txt    │               new.txt               │
                          │     B/op     │    B/op     vs base                 │
_ToLowerBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-2     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-2     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-2          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-2        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                          │   old.txt    │               new.txt               │
                          │  allocs/op   │ allocs/op   vs base                 │
_ToLowerBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-2       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-2     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-2     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-2   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-2          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-2        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Benchmarks on Apple M3 Max

goos: darwin
goarch: arm64
pkg: github.com/gofiber/utils/v2
cpu: Apple M3 Max
                           │   old.txt   │               new.txt               │
                           │   sec/op    │   sec/op     vs base                │
_ToLowerBytes/fiber-16       22.19n ± 1%   18.06n ± 1%  -18.59% (p=0.000 n=20)
_ToLowerBytes/default-16     75.92n ± 1%   76.93n ± 1%   +1.32% (p=0.002 n=20)
_ToUpperBytes/fiber-16       26.19n ± 0%   17.95n ± 1%  -31.46% (p=0.000 n=20)
_ToUpperBytes/default-16     81.73n ± 0%   81.56n ± 1%        ~ (p=0.331 n=20)
_EqualFoldBytes/fiber-16     29.25n ± 0%   31.05n ± 1%   +6.17% (p=0.000 n=20)
_EqualFoldBytes/default-16   58.96n ± 0%   61.63n ± 1%   +4.53% (p=0.000 n=20)
_EqualFold/fiber-16          39.42n ± 0%   30.68n ± 1%  -22.17% (p=0.000 n=20)
_EqualFold/default-16        58.99n ± 0%   59.39n ± 1%   +0.69% (p=0.000 n=20)
geomean                      44.16n        40.40n        -8.52%

                           │   old.txt    │               new.txt               │
                           │     B/op     │    B/op     vs base                 │
_ToLowerBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-16     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-16     80.00 ± 0%     80.00 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-16     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-16   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-16          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-16        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                 ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                           │   old.txt    │               new.txt               │
                           │  allocs/op   │ allocs/op   vs base                 │
_ToLowerBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToLowerBytes/default-16     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/fiber-16       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_ToUpperBytes/default-16     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/fiber-16     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFoldBytes/default-16   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/fiber-16          0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
_EqualFold/default-16        0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=20) ¹
geomean                                 ²               +0.00%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

Summary by CodeRabbit

  • Chores

    • Internal performance optimizations for case-insensitive comparisons and case conversions via loop unrolling.
  • Tests

    • Benchmarks renamed for clarity, split into sub-benchmarks and now include per-run validation to better track performance.
  • Bug

    • An invalid iteration form was introduced that can cause build failures; a fix or revert is required to restore successful builds.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 3, 2025

Walkthrough

Replaces per-element loops with 4-item unrolled loops in ASCII case-conversion and case-insensitive comparison; updates IPv4 loop to an incorrect range form; splits IP benchmarks into sub-benchmarks and adds assertions. No exported APIs changed.

Changes

Cohort / File(s) Summary
Loop unrolling — byte case conversion
bytes.go
Replaced per-byte range loop in ToLowerBytes / ToUpperBytes with an explicit 4-byte unrolled loop using a local table alias and a residual tail loop; behavior and signatures unchanged.
Loop unrolling — case-insensitive compare
byteseq.go
Replaced simple index-based equality loop with a 4-iteration unrolled block that uses a local table reference plus a tail loop; preserves behavior and public API.
IPv4 iteration change (potentially invalid)
ips.go
Changed for i := 0; i < net.IPv4len; i++ { to for i := range net.IPv4len { (converts index-bound loop to a range over an integer constant) — this form is likely incorrect/compilable issue.
Benchmarks — split sub-benchmarks & assertions
ips_test.go
Converted Benchmark_IsIPv4 and Benchmark_IsIPv6 to run two sub-benchmarks ("fiber" and "default" using net.ParseIP), added per-sub-benchmark result checks via require.True.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant ToLower_old as ToLowerBytes (old)
    participant ToLower_new as ToLowerBytes (new)
    note right of ToLower_old #f6f8ff: Old flow — simple per-byte loop
    Caller->>ToLower_old: call(bytes)
    ToLower_old->>ToLower_old: for i { out[i]=table[bytes[i]] }
    ToLower_old-->>Caller: result

    note right of ToLower_new #f0fff4: New flow — unrolled 4-byte blocks + tail
    Caller->>ToLower_new: call(bytes)
    ToLower_new->>ToLower_new: for i += 4 { map b0..b3 via table -> out[i..i+3] }
    ToLower_new->>ToLower_new: for remaining { map b -> out[i] }
    ToLower_new-->>Caller: result
Loading
sequenceDiagram
    participant Caller
    participant Eq_old as EqualFold (old)
    participant Eq_new as EqualFold (new)
    note right of Eq_old #fffaf0: Old flow — per-index compare
    Caller->>Eq_old: call(a,b)
    Eq_old->>Eq_old: for i { map[a[i]] == map[b[i]] ? continue : return false }
    Eq_old-->>Caller: true/false

    note right of Eq_new #f9f8ff: New flow — unrolled compare + tail
    Caller->>Eq_new: call(a,b)
    Eq_new->>Eq_new: for i += 4 { compare mapped a[i..i+3] to b[i..i+3] }
    Eq_new->>Eq_new: for remaining { compare mapped a[i] to b[i] }
    Eq_new-->>Caller: true/false
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify correctness for inputs with length not divisible by 4 and for very small lengths.
  • Inspect ips.go change — for i := range net.IPv4len appears invalid; confirm intended loop and restore index-based loop.
  • Confirm unrolled loops preserve exact mapping and have no subtle bounds or aliasing bugs.
  • Review benchmarks (ips_test.go) for correct result assertions and side effects.

Possibly related PRs

Suggested reviewers

  • gaby
  • sixcolors
  • ReneWerner87
  • efectn

Poem

🐇 I hopped through bytes in rhythmic score,

Four at once, then one — then four once more.
Maps stay the same, I nudge the pace,
Benchmarks split, a tiny race.
A rabbit smiles — faster, with grace.

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'perf: unroll loops (saves ~20-40% cpu)' accurately reflects the main objective of the changeset. The code modifications across bytes.go, byteseq.go, and ips.go all involve replacing simple loops with unrolled loop implementations that process multiple bytes per iteration. The performance claim is substantiated by the benchmark results documented in the PR objectives, showing geomean latency improvements of 8-12% and specific function improvements up to 40%. The title is concise, clear, and directly summarizes the primary change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb8645b and 9ef41fd.

📒 Files selected for processing (1)
  • ips.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ips.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build (1.24.x, windows-latest)
  • GitHub Check: Build (1.23.x, windows-latest)
  • GitHub Check: Compare

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@arturmelanchyk arturmelanchyk changed the title Unroll loops (saves ~20-40% cpu) perf: unroll loops (saves ~20-40% cpu) Nov 3, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 86.66667% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.31%. Comparing base (69c3d29) to head (9ef41fd).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
byteseq.go 68.00% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #158      +/-   ##
==========================================
+ Coverage   83.25%   83.31%   +0.05%     
==========================================
  Files          11       11              
  Lines         848      755      -93     
==========================================
- Hits          706      629      -77     
+ Misses        135      115      -20     
- Partials        7       11       +4     
Flag Coverage Δ
unittests 83.31% <86.66%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ReneWerner87
Copy link
Copy Markdown
Member

Nice catch

@arturmelanchyk arturmelanchyk marked this pull request as ready for review November 3, 2025 15:42
@arturmelanchyk arturmelanchyk requested a review from a team as a code owner November 3, 2025 15:42
@arturmelanchyk arturmelanchyk requested review from ReneWerner87, efectn, gaby and sixcolors and removed request for a team November 3, 2025 15:42
@arturmelanchyk
Copy link
Copy Markdown
Contributor Author

@ReneWerner87 modernize lint seems unrelated to the changes made in the PR

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
bytes.go (1)

37-65: LGTM! Duplication is acceptable for this hot path.

The implementation mirrors ToLowerBytes exactly (differing only in the table used). While this creates duplication, the simplicity aids compiler optimization and inlining. Given the 18–31% performance improvements shown in the benchmarks, the duplication is justified for this performance-critical code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 69c3d29 and 9ea03e1.

📒 Files selected for processing (2)
  • bytes.go (1 hunks)
  • byteseq.go (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T00:32:59.671Z
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.

Applied to files:

  • byteseq.go
🔇 Additional comments (2)
byteseq.go (1)

12-54: LGTM! Well-optimized loop unrolling with correct bounds handling.

The 4x unrolled comparison loop is implemented correctly:

  • The limit := n &^ 3 computation properly rounds down to the nearest multiple of 4, ensuring all accesses b[i+0] through b[i+3] are in bounds when i < limit
  • Early return on mismatch in each unrolled block maintains correctness
  • Tail loop correctly handles the remaining 0–3 elements
  • Local table reference may help the compiler with optimization

The benchmarks confirm meaningful speedups (−21% to −23% on EqualFold/fiber). The increased verbosity is justified by the performance gains.

bytes.go (1)

8-35: LGTM! Correct loop unrolling with safe in-place mutation.

The 4x unrolled transformation is implemented correctly:

  • Loading all 4 bytes into locals (b0 through b3) before writing back ensures safe in-place mutation
  • The limit calculation and tail loop are correct (same pattern as in byteseq.go)
  • The benchmarks show substantial improvements (−40% on ToLowerBytes/fiber for AMD EPYC)

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
func IsIPv4(s string) bool {
//nolint:modernize-loop // old way is more readable
for i := 0; i < net.IPv4len; i++ {
for i := range net.IPv4len {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ea03e1 and eb8645b.

📒 Files selected for processing (2)
  • ips.go (1 hunks)
  • ips_test.go (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.
📚 Learning: 2025-09-18T00:32:59.671Z
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.

Applied to files:

  • ips.go
📚 Learning: 2025-10-22T13:48:12.485Z
Learnt from: CR
Repo: gofiber/fiber PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-22T13:48:12.485Z
Learning: Run benchmarks with `go test` (Make target: benchmark)

Applied to files:

  • ips_test.go
📚 Learning: 2024-10-16T12:12:30.506Z
Learnt from: gaby
Repo: gofiber/fiber PR: 3170
File: ctx_test.go:1721-1724
Timestamp: 2024-10-16T12:12:30.506Z
Learning: In the Go unit tests in `ctx_test.go`, it is acceptable to use invalid CIDR notation such as `"0.0.0.1/31junk"` for testing purposes.

Applied to files:

  • ips_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Compare
🔇 Additional comments (2)
ips_test.go (2)

104-122: LGTM! Improved benchmark structure.

The addition of sub-benchmarks for "fiber" vs "default" implementations allows for clear performance comparison between the custom IsIPv4 function and the standard library's net.ParseIP. The final assertions prevent compiler optimizations from eliminating the benchmarked code.


124-142: LGTM! Consistent benchmark improvements.

The benchmark structure mirrors the IPv4 changes appropriately, enabling performance comparison between the custom IsIPv6 implementation and the standard library.

Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
@arturmelanchyk arturmelanchyk merged commit 9417f80 into gofiber:master Nov 3, 2025
19 checks passed
@arturmelanchyk
Copy link
Copy Markdown
Contributor Author

Hi @ReneWerner87 any plans on releasing this and upgrading dependency version in fiber repo?

@ReneWerner87
Copy link
Copy Markdown
Member

On the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants