perf: unroll loops (saves ~20-40% cpu)#158
Conversation
WalkthroughReplaces per-element loops with 4-item unrolled loops in ASCII case-conversion and case-insensitive comparison; updates IPv4 loop to an incorrect range form; splits IP benchmarks into sub-benchmarks and adds assertions. No exported APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ToLower_old as ToLowerBytes (old)
participant ToLower_new as ToLowerBytes (new)
note right of ToLower_old #f6f8ff: Old flow — simple per-byte loop
Caller->>ToLower_old: call(bytes)
ToLower_old->>ToLower_old: for i { out[i]=table[bytes[i]] }
ToLower_old-->>Caller: result
note right of ToLower_new #f0fff4: New flow — unrolled 4-byte blocks + tail
Caller->>ToLower_new: call(bytes)
ToLower_new->>ToLower_new: for i += 4 { map b0..b3 via table -> out[i..i+3] }
ToLower_new->>ToLower_new: for remaining { map b -> out[i] }
ToLower_new-->>Caller: result
sequenceDiagram
participant Caller
participant Eq_old as EqualFold (old)
participant Eq_new as EqualFold (new)
note right of Eq_old #fffaf0: Old flow — per-index compare
Caller->>Eq_old: call(a,b)
Eq_old->>Eq_old: for i { map[a[i]] == map[b[i]] ? continue : return false }
Eq_old-->>Caller: true/false
note right of Eq_new #f9f8ff: New flow — unrolled compare + tail
Caller->>Eq_new: call(a,b)
Eq_new->>Eq_new: for i += 4 { compare mapped a[i..i+3] to b[i..i+3] }
Eq_new->>Eq_new: for remaining { compare mapped a[i] to b[i] }
Eq_new-->>Caller: true/false
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #158 +/- ##
==========================================
+ Coverage 83.25% 83.31% +0.05%
==========================================
Files 11 11
Lines 848 755 -93
==========================================
- Hits 706 629 -77
+ Misses 135 115 -20
- Partials 7 11 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Nice catch |
|
@ReneWerner87 modernize lint seems unrelated to the changes made in the PR |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
bytes.go (1)
37-65: LGTM! Duplication is acceptable for this hot path.The implementation mirrors
ToLowerBytesexactly (differing only in the table used). While this creates duplication, the simplicity aids compiler optimization and inlining. Given the 18–31% performance improvements shown in the benchmarks, the duplication is justified for this performance-critical code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
bytes.go(1 hunks)byteseq.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-18T00:32:59.671Z
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.
Applied to files:
byteseq.go
🔇 Additional comments (2)
byteseq.go (1)
12-54: LGTM! Well-optimized loop unrolling with correct bounds handling.The 4x unrolled comparison loop is implemented correctly:
- The
limit := n &^ 3computation properly rounds down to the nearest multiple of 4, ensuring all accessesb[i+0]throughb[i+3]are in bounds wheni < limit- Early return on mismatch in each unrolled block maintains correctness
- Tail loop correctly handles the remaining 0–3 elements
- Local table reference may help the compiler with optimization
The benchmarks confirm meaningful speedups (−21% to −23% on EqualFold/fiber). The increased verbosity is justified by the performance gains.
bytes.go (1)
8-35: LGTM! Correct loop unrolling with safe in-place mutation.The 4x unrolled transformation is implemented correctly:
- Loading all 4 bytes into locals (
b0throughb3) before writing back ensures safe in-place mutation- The
limitcalculation and tail loop are correct (same pattern as inbyteseq.go)- The benchmarks show substantial improvements (−40% on ToLowerBytes/fiber for AMD EPYC)
Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
| func IsIPv4(s string) bool { | ||
| //nolint:modernize-loop // old way is more readable | ||
| for i := 0; i < net.IPv4len; i++ { | ||
| for i := range net.IPv4len { |
There was a problem hiding this comment.
this was failing modernize lint https://github.com/gofiber/utils/actions/runs/19038208035/job/54368027903
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
ips.go(1 hunks)ips_test.go(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.
📚 Learning: 2025-09-18T00:32:59.671Z
Learnt from: sixcolors
Repo: gofiber/fiber PR: 0
File: :0-0
Timestamp: 2025-09-18T00:32:59.671Z
Learning: In Go 1.22, manual copying of loop variables (like `x := x`) is no longer necessary due to the fix for variable capture in closures. Each loop iteration now creates its own instance of the loop variable automatically.
Applied to files:
ips.go
📚 Learning: 2025-10-22T13:48:12.485Z
Learnt from: CR
Repo: gofiber/fiber PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-22T13:48:12.485Z
Learning: Run benchmarks with `go test` (Make target: benchmark)
Applied to files:
ips_test.go
📚 Learning: 2024-10-16T12:12:30.506Z
Learnt from: gaby
Repo: gofiber/fiber PR: 3170
File: ctx_test.go:1721-1724
Timestamp: 2024-10-16T12:12:30.506Z
Learning: In the Go unit tests in `ctx_test.go`, it is acceptable to use invalid CIDR notation such as `"0.0.0.1/31junk"` for testing purposes.
Applied to files:
ips_test.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Compare
🔇 Additional comments (2)
ips_test.go (2)
104-122: LGTM! Improved benchmark structure.The addition of sub-benchmarks for "fiber" vs "default" implementations allows for clear performance comparison between the custom IsIPv4 function and the standard library's net.ParseIP. The final assertions prevent compiler optimizations from eliminating the benchmarked code.
124-142: LGTM! Consistent benchmark improvements.The benchmark structure mirrors the IPv4 changes appropriately, enabling performance comparison between the custom IsIPv6 implementation and the standard library.
Signed-off-by: Artur Melanchyk <13834276+arturmelanchyk@users.noreply.github.com>
|
Hi @ReneWerner87 any plans on releasing this and upgrading dependency version in fiber repo? |
|
On the weekend |
Benchmarks on
AMD EPYCBenchmarks on Apple M3 Max
Summary by CodeRabbit
Chores
Tests
Bug