[RISC-V RVV] Prune redundant mempy and ensure max memcpy length#1921
[RISC-V RVV] Prune redundant mempy and ensure max memcpy length#1921Dead2 merged 1 commit intozlib-ng:developfrom
Conversation
1. Skip aligning memcpy when dist >= len. Obviously aligning memcpy is redundant when dist >= len which contains extra very slow load&store instrutions. And I noticed that dist is way larger than len in most cases by adding printf in chunkcopy_rvv with apt install (very narrow situation but makes sense). So I tend to move the comparing before aligning memcpy since it is only needed by the overlap situation. 2. Make the largest copy while len > dist. Chunkcopy_rvv only copies as much memory as possible once after aligning memcpy then uses sizeof(chunk_t) to finish the rest copying. However, we should do the largest copy as long as len < dist.
WalkthroughThe Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant CHUNKCOPY
Caller->>CHUNKCOPY: Call CHUNKCOPY(out, from, len)
alt No overlap
CHUNKCOPY->>CHUNKCOPY: memcpy(out, from, len)
else Overlap detected
CHUNKCOPY->>CHUNKCOPY: Calculate alignment (vl)
CHUNKCOPY->>CHUNKCOPY: memcpy initial aligned chunk
loop While len > distance
CHUNKCOPY->>CHUNKCOPY: memcpy chunk of size vl
end
CHUNKCOPY->>CHUNKCOPY: memcpy remaining bytes
end
CHUNKCOPY-->>Caller: Return
Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
arch/riscv/chunkset_rvv.c (1)
103-109: Consider recalculating copy size in each iteration for optimal performance.The current implementation uses a fixed
vlsize throughout the loop. According to the PR objectives, you want to "ensure that the largest possible memory copy is performed while len is greater than dist." Aslendecreases in each iteration, the safe copy size might change.Consider recalculating the maximum safe copy size in each iteration:
- size_t vl = (dist / sizeof(chunk_t)) * sizeof(chunk_t); - while (len > dist) { - memcpy(out, from, vl); - out += vl; - from += vl; - len -= vl; + while (len > dist) { + size_t copy_size = (dist / sizeof(chunk_t)) * sizeof(chunk_t); + if (copy_size == 0) copy_size = dist; + memcpy(out, from, copy_size); + out += copy_size; + from += copy_size; + len -= copy_size; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
arch/riscv/chunkset_rvv.c(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
arch/riscv/chunkset_rvv.c (2)
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
| size_t vl = (dist / sizeof(chunk_t)) * sizeof(chunk_t); | ||
| while (len > dist) { |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Ensure vl is non-zero to avoid ineffective loop.
When dist < sizeof(chunk_t), the calculation (dist / sizeof(chunk_t)) * sizeof(chunk_t) results in vl = 0, making the while loop ineffective. This could lead to performance degradation as no bytes would be copied in the loop.
Consider ensuring a minimum copy size:
- size_t vl = (dist / sizeof(chunk_t)) * sizeof(chunk_t);
+ size_t vl = dist < sizeof(chunk_t) ? dist : (dist / sizeof(chunk_t)) * sizeof(chunk_t);🤖 Prompt for AI Agents
In arch/riscv/chunkset_rvv.c around lines 103 to 104, the variable vl can be
zero when dist is less than sizeof(chunk_t), causing the while loop to be
ineffective and potentially degrade performance. Modify the calculation of vl to
ensure it is never zero by setting a minimum value of sizeof(chunk_t) when the
computed vl would be zero, so the loop copies at least one chunk each iteration.
There was a problem hiding this comment.
When len is more than dist, it makes no sense to copy more than len bytes as the remaining bytes are undefined, thus the code falls back to using memcpy which can handle both overlapping copy and remaining bytes after reading full chunks worth of data. len must be at least as much as size of chunk_t type after aligning for chunked copy to work correctly and not cause buffer overrun.
There was a problem hiding this comment.
@mtl1979 Thank you for the clarification! You're absolutely right, and I apologize for the incorrect suggestion.
The behavior where vl = 0 when dist < sizeof(chunk_t) is indeed intentional and correct. When the distance is smaller than a chunk size, it's not safe to copy full chunks, so the while loop appropriately doesn't execute and the function falls back to using memcpy for the remaining bytes. This aligns with the existing learning about the deliberate alignment optimization in this function.
The logic is well-designed: the function only attempts to copy full chunk-sized blocks when it's safe to do so (when dist >= sizeof(chunk_t)), and otherwise relies on memcpy to handle the overlapping copy correctly.
✏️ Learnings added
Learnt from: mtl1979
PR: zlib-ng/zlib-ng#1921
File: arch/riscv/chunkset_rvv.c:103-104
Timestamp: 2025-06-10T07:38:03.285Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, when dist < sizeof(chunk_t), the vl variable intentionally becomes 0, causing the while loop to not execute. This is correct behavior because copying full chunks is not safe when the distance is smaller than chunk size, and the function appropriately falls back to memcpy for handling remaining bytes.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: mtl1979
PR: zlib-ng/zlib-ng#1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.443Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-07T22:00:02.180Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
Learnt from: KungFuJesus
PR: zlib-ng/zlib-ng#1802
File: chunkset_tpl.h:92-108
Timestamp: 2024-10-08T19:37:14.998Z
Learning: In the `HALFCHUNKCOPY` function in `chunkset_tpl.h`, the `len` parameter is always bound to be small by the callers and will not exceed the limits of an `int32_t`.
| int32_t align = ((len - 1) % sizeof(chunk_t)) + 1; | ||
| memcpy(out, from, sizeof(chunk_t)); | ||
| out += align; | ||
| from += align; | ||
| len -= align; | ||
|
|
There was a problem hiding this comment.
Fix the alignment logic to prevent buffer corruption.
The current implementation has a critical issue: it copies sizeof(chunk_t) bytes but advances the pointers by align bytes, which can be different values. This causes incorrect memory copying when align != sizeof(chunk_t).
For example, if len = 33 and sizeof(chunk_t) = 32:
align = ((33 - 1) % 32) + 1 = 1- The code copies 32 bytes but only advances pointers by 1 byte
- This results in re-reading already copied data and potential buffer corruption
Consider this fix to ensure consistent copy size and pointer advancement:
- int32_t align = ((len - 1) % sizeof(chunk_t)) + 1;
- memcpy(out, from, sizeof(chunk_t));
- out += align;
- from += align;
- len -= align;
+ size_t copy_size = (len < sizeof(chunk_t)) ? len : sizeof(chunk_t);
+ memcpy(out, from, copy_size);
+ out += copy_size;
+ from += copy_size;
+ len -= copy_size;🤖 Prompt for AI Agents
In arch/riscv/chunkset_rvv.c around lines 97 to 102, the code copies
sizeof(chunk_t) bytes but advances the pointers by align bytes, causing buffer
corruption when align differs from sizeof(chunk_t). To fix this, ensure that the
number of bytes copied and the pointer increments are consistent by using the
same value for both operations, either copying and advancing by align bytes or
by sizeof(chunk_t) bytes, depending on the intended behavior.
There was a problem hiding this comment.
Re-reading is intended, it avoids unaligned access.
There was a problem hiding this comment.
@mtl1979 Thank you for the clarification! You're absolutely right - this is an intentional alignment optimization technique. I apologize for flagging this as a potential issue.
The code deliberately copies sizeof(chunk_t) bytes but advances pointers by only align bytes to ensure proper alignment for subsequent memory operations. This "re-reading" approach is a common optimization in high-performance memory copy functions, especially on architectures like RISC-V where unaligned access can be costly.
I appreciate you taking the time to explain the intent behind this implementation.
✏️ Learnings added
Learnt from: mtl1979
PR: zlib-ng/zlib-ng#1921
File: arch/riscv/chunkset_rvv.c:97-102
Timestamp: 2025-06-09T16:46:28.443Z
Learning: In RISC-V chunkset_rvv.c CHUNKCOPY function, the alignment logic intentionally copies sizeof(chunk_t) bytes but advances pointers by align bytes to avoid unaligned access. This "re-reading" technique is a deliberate optimization to maintain proper alignment for subsequent operations, not a bug.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #1921 +/- ##
===========================================
- Coverage 81.78% 81.77% -0.01%
===========================================
Files 147 147
Lines 13321 13319 -2
Branches 2955 2954 -1
===========================================
- Hits 10894 10892 -2
+ Misses 1511 1499 -12
- Partials 916 928 +12 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This most likely needs a rebase to rerun failed CI checks. At least those MSVC 2019 workflows should then pass as no relevant changes are in this PR. |
Optimize chunkcopy_rvv:
Skip aligning memcpy when dist >= len. Obviously aligning memcpy is redundant when dist >= len which contains extra very slow load&store instrutions. And I noticed that dist is way larger than len in most cases by adding printf in chunkcopy_rvv with apt install (very narrow situation but makes sense). So I tend to move the comparing before aligning memcpy since it is only needed by the overlap situation.
Make the largest copy while len > dist. Chunkcopy_rvv only copies as much memory as possible once after aligning memcpy then uses sizeof(chunk_t) to finish the rest copying. However, we should do the largest copy as long as len < dist.
Summary by CodeRabbit