xsk: i40e: Tx performance improvements#320
Closed
kernel-patches-bot wants to merge 6 commits intobpf-nextfrom
Closed
xsk: i40e: Tx performance improvements#320kernel-patches-bot wants to merge 6 commits intobpf-nextfrom
kernel-patches-bot wants to merge 6 commits intobpf-nextfrom
Conversation
Author
|
Master branch: f52b8fd |
Author
|
Master branch: 0e6f601 |
f68bf48 to
f99f7ee
Compare
Increment the statistics over how many Tx packets have been sent at the time of sending instead of at the time of completion. This as a completion event means that the buffer has been sent AND returned to user space. The packet always gets sent shortly after sendto() is called. The kernel might, for performance reasons, decide to not return every single buffer to user space immediately after sending, for example, only after a batch of packets have been transmitted. Incrementing the number of packets sent at completion, will in that case be confusing as if you send a single packet, the counter might show zero for a while even though the packet has been transmitted. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Remove the unnecessary access to the software ring for the AF_XDP zero-copy driver. This was used to record the length of the packet so that the driver Tx completion code could sum this up to produce the total bytes sent. This is now performed during the transmission of the packet, so no need to record this in the software ring. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Introduce one cache line worth of padding between the consumer pointer and the flags field as well as between the flags field and the start of the descriptors in all the lockless rings. This so that the x86 HW adjacency prefetcher will not prefetch the adjacent pointer/field when only one pointer/field is going to be used. This improves throughput performance for the l2fwd sample app with 1% on my machine with HW prefetching turned on in the BIOS. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Introduce batched descriptor interfaces in the xsk core code for the Tx path to be used in the driver to write a code path with higher performance. This interface will be used by the i40e driver in the next patch. Though other drivers would likely benefit from this new interface too. Note that batching is only implemented for the common case when there is only one socket bound to the same device and queue id. When this is not the case, we fall back to the old non-batched version of the function. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Use the new batched xsk interfaces for the Tx path in the i40e driver to improve performance. On my machine, this yields a throughput increase of 4% for the l2fwd sample app in xdpsock. If we instead just look at the Tx part, this patch set increases throughput with above 20% for Tx. Note that I had to explicitly loop unroll the inner loop to get to this performance level, by using a pragma. It is honored by both clang and gcc and should be ignored by versions that do not support it. Using the -funroll-loops compiler command line switch on the source file resulted in a loop unrolling on a higher level that lead to a performance decrease instead of an increase. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
Author
|
Master branch: 9600d62 |
f99f7ee to
8df8e06
Compare
Author
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=381107 expired. Closing PR. |
kernel-patches-bot
pushed a commit
that referenced
this pull request
Jun 28, 2021
…format For incoming SCO connection with transparent coding format, alt setting of CVSD is getting applied instead of Transparent. Before fix: < HCI Command: Accept Synchron.. (0x01|0x0029) plen 21 #2196 [hci0] 321.342548 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Transmit bandwidth: 8000 Receive bandwidth: 8000 Max latency: 13 Setting: 0x0003 Input Coding: Linear Input Data Format: 1's complement Input Sample Size: 8-bit # of bits padding at MSB: 0 Air Coding Format: Transparent Data Retransmission effort: Optimize for link quality (0x02) Packet type: 0x003f HV1 may be used HV2 may be used HV3 may be used EV3 may be used EV4 may be used EV5 may be used > HCI Event: Command Status (0x0f) plen 4 #2197 [hci0] 321.343585 Accept Synchronous Connection Request (0x01|0x0029) ncmd 1 Status: Success (0x00) > HCI Event: Synchronous Connect Comp.. (0x2c) plen 17 #2198 [hci0] 321.351666 Status: Success (0x00) Handle: 257 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Link type: eSCO (0x02) Transmission interval: 0x0c Retransmission window: 0x04 RX packet length: 60 TX packet length: 60 Air mode: Transparent (0x03) ........ > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2336 [hci0] 321.383655 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2337 [hci0] 321.389558 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2338 [hci0] 321.393615 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2339 [hci0] 321.393618 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2340 [hci0] 321.393618 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2341 [hci0] 321.397070 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2342 [hci0] 321.403622 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2343 [hci0] 321.403625 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2344 [hci0] 321.403625 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2345 [hci0] 321.403625 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2346 [hci0] 321.404569 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2347 [hci0] 321.412091 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2348 [hci0] 321.413626 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2349 [hci0] 321.413630 > SCO Data RX: Handle 257 flags 0x00 dlen 48 #2350 [hci0] 321.413630 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #2351 [hci0] 321.419674 After fix: < HCI Command: Accept Synchronou.. (0x01|0x0029) plen 21 #309 [hci0] 49.439693 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Transmit bandwidth: 8000 Receive bandwidth: 8000 Max latency: 13 Setting: 0x0003 Input Coding: Linear Input Data Format: 1's complement Input Sample Size: 8-bit # of bits padding at MSB: 0 Air Coding Format: Transparent Data Retransmission effort: Optimize for link quality (0x02) Packet type: 0x003f HV1 may be used HV2 may be used HV3 may be used EV3 may be used EV4 may be used EV5 may be used > HCI Event: Command Status (0x0f) plen 4 #310 [hci0] 49.440308 Accept Synchronous Connection Request (0x01|0x0029) ncmd 1 Status: Success (0x00) > HCI Event: Synchronous Connect Complete (0x2c) plen 17 #311 [hci0] 49.449308 Status: Success (0x00) Handle: 257 Address: 1C:CC:D6:E2:EA:80 (Xiaomi Communications Co Ltd) Link type: eSCO (0x02) Transmission interval: 0x0c Retransmission window: 0x04 RX packet length: 60 TX packet length: 60 Air mode: Transparent (0x03) < SCO Data TX: Handle 257 flags 0x00 dlen 60 #312 [hci0] 49.450421 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #313 [hci0] 49.457927 > HCI Event: Max Slots Change (0x1b) plen 3 #314 [hci0] 49.460345 Handle: 256 Max slots: 5 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #315 [hci0] 49.465453 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #316 [hci0] 49.470502 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #317 [hci0] 49.470519 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #318 [hci0] 49.472996 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #319 [hci0] 49.480412 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #320 [hci0] 49.480492 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #321 [hci0] 49.487989 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #322 [hci0] 49.490303 < SCO Data TX: Handle 257 flags 0x00 dlen 60 #323 [hci0] 49.495496 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #324 [hci0] 49.500304 > SCO Data RX: Handle 257 flags 0x00 dlen 60 #325 [hci0] 49.500311 Signed-off-by: Kiran K <kiran.k@intel.com> Signed-off-by: Lokendra Singh <lokendra.singh@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
kernel-patches-daemon-bpf bot
pushed a commit
that referenced
this pull request
Mar 5, 2024
Recent additions in BPF like cpu v4 instructions, test_bpf module exhibits the following failures: test_bpf: #82 ALU_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #83 ALU_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #84 ALU64_MOVSX | BPF_B jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #85 ALU64_MOVSX | BPF_H jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #86 ALU64_MOVSX | BPF_W jited:1 ret 2 != 1 (0x2 != 0x1)FAIL (1 times) test_bpf: #165 ALU_SDIV_X: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times) test_bpf: #166 ALU_SDIV_K: -6 / 2 = -3 jited:1 ret 2147483645 != -3 (0x7ffffffd != 0xfffffffd)FAIL (1 times) test_bpf: #169 ALU_SMOD_X: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #170 ALU_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #172 ALU64_SMOD_K: -7 % 2 = -1 jited:1 ret 1 != -1 (0x1 != 0xffffffff)FAIL (1 times) test_bpf: #313 BSWAP 16: 0x0123456789abcdef -> 0xefcd eBPF filter opcode 00d7 (@2) unsupported jited:0 301 PASS test_bpf: #314 BSWAP 32: 0x0123456789abcdef -> 0xefcdab89 eBPF filter opcode 00d7 (@2) unsupported jited:0 555 PASS test_bpf: #315 BSWAP 64: 0x0123456789abcdef -> 0x67452301 eBPF filter opcode 00d7 (@2) unsupported jited:0 268 PASS test_bpf: #316 BSWAP 64: 0x0123456789abcdef >> 32 -> 0xefcdab89 eBPF filter opcode 00d7 (@2) unsupported jited:0 269 PASS test_bpf: #317 BSWAP 16: 0xfedcba9876543210 -> 0x1032 eBPF filter opcode 00d7 (@2) unsupported jited:0 460 PASS test_bpf: #318 BSWAP 32: 0xfedcba9876543210 -> 0x10325476 eBPF filter opcode 00d7 (@2) unsupported jited:0 320 PASS test_bpf: #319 BSWAP 64: 0xfedcba9876543210 -> 0x98badcfe eBPF filter opcode 00d7 (@2) unsupported jited:0 222 PASS test_bpf: #320 BSWAP 64: 0xfedcba9876543210 >> 32 -> 0x10325476 eBPF filter opcode 00d7 (@2) unsupported jited:0 273 PASS test_bpf: #344 BPF_LDX_MEMSX | BPF_B eBPF filter opcode 0091 (@5) unsupported jited:0 432 PASS test_bpf: #345 BPF_LDX_MEMSX | BPF_H eBPF filter opcode 0089 (@5) unsupported jited:0 381 PASS test_bpf: #346 BPF_LDX_MEMSX | BPF_W eBPF filter opcode 0081 (@5) unsupported jited:0 505 PASS test_bpf: #490 JMP32_JA: Unconditional jump: if (true) return 1 eBPF filter opcode 0006 (@1) unsupported jited:0 261 PASS test_bpf: Summary: 1040 PASSED, 10 FAILED, [924/1038 JIT'ed] Fix them by adding missing processing. Fixes: daabb2b ("bpf/tests: add tests for cpuv4 instructions") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: xsk: i40e: Tx performance improvements
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=381107