Skip to content

Cherrypick #8657 and #8667 to v1.77.x#8690

Merged
arjan-bal merged 2 commits into
grpc:v1.77.xfrom
arjan-bal:cherrypick-copyless-framer
Nov 3, 2025
Merged

Cherrypick #8657 and #8667 to v1.77.x#8690
arjan-bal merged 2 commits into
grpc:v1.77.xfrom
arjan-bal:cherrypick-copyless-framer

Conversation

@arjan-bal

Copy link
Copy Markdown
Contributor

Original PRs: #8657, #8667

RELEASE NOTES:

  • transport: Avoid copies when reading and writing Data frames.

This change incorporates changes from
golang/go#73560 to split reading HTTP/2 frame
headers and payloads. If the frame is not a Data frame, it's read
through the standard library framer as before. For Data frames, the
payload is read directly into a buffer from the buffer pool to avoid
copying it from the framer's buffer.

## Testing
For 1 MB payloads, this results in ~4% improvement in throughput.

```sh
# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=streaming \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=1000000 -respSizeBytes=1000000 -networkMode=Local -resultFile="${RUN_NAME}"

# comparison
go run benchmark/benchresult/main.go streaming-before streaming-after  
               Title       Before        After Percentage
            TotalOps        87536        91120     4.09%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op   4074102.92   4070489.30    -0.09%
           Allocs/op        83.60        76.55    -8.37%
             ReqT/op 11671466666.67 12149333333.33     4.09%
            RespT/op 11671466666.67 12149333333.33     4.09%
            50th-Lat  78.209875ms  75.159943ms    -3.90%
            90th-Lat 117.764228ms   107.8697ms    -8.40%
            99th-Lat 146.935704ms 139.069685ms    -5.35%
             Avg-Lat  82.310691ms  79.073282ms    -3.93%
           GoVersion     go1.24.7     go1.24.7
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

For smaller payloads, the difference in minor.
```sh
go run benchmark/benchmain/main.go -benchtime=60s -workloads=streaming \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

go run benchmark/benchresult/main.go streaming-before streaming-after 
               Title       Before        After Percentage
            TotalOps     21490752     21477822    -0.06%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      1902.92      1902.94     0.00%
           Allocs/op        29.21        29.21     0.00%
             ReqT/op 286543360.00 286370960.00    -0.06%
            RespT/op 286543360.00 286370960.00    -0.06%
            50th-Lat    352.505µs    352.247µs    -0.07%
            90th-Lat    433.446µs    434.907µs     0.34%
            99th-Lat    536.445µs    539.759µs     0.62%
             Avg-Lat    333.403µs    333.457µs     0.02%
           GoVersion     go1.24.7     go1.24.7
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

RELEASE NOTES:
* transport: Avoid a buffer copy when reading data.
…c#8667)

This PR removes 2 buffer copies while writing data frames to the
underlying net.Conn: one [within
gRPC](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/controlbuf.go#L1009-L1022)
and the other [in the
framer](https://cs.opensource.google/go/x/net/+/master:http2/frame.go;l=743;drc=6e243da531559f8c99439dabc7647dec07191f9b).
Care is taken to avoid any extra heap allocations which can affect
performance for smaller payloads.

A [CL](https://go-review.git.corp.google.com/c/net/+/711620) is out for
review which allows using the framer to write frame headers. This PR
duplicates the header writing code as a temporary workaround. This PR
will be merged only after the CL is merged.

## Results

### Small payloads
Performance for small payloads increases slightly due to the reduction
of a `deferred` statement.
```
$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

$ go run benchmark/benchresult/main.go unary-before unary-after
               Title       Before        After Percentage
            TotalOps      7600878      7653522     0.69%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10007.07     10000.89    -0.07%
           Allocs/op       146.93       146.91     0.00%
             ReqT/op 101345040.00 102046960.00     0.69%
            RespT/op 101345040.00 102046960.00     0.69%
            50th-Lat    833.724µs    830.041µs    -0.44%
            90th-Lat   1.281969ms   1.275336ms    -0.52%
            99th-Lat   2.403961ms   2.360606ms    -1.80%
             Avg-Lat    946.123µs    939.734µs    -0.68%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.77.0-dev   1.77.0-dev
```

### Large payloads
Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev
machine. The profiles show increased time spent in the copy operation
[inside the buffered
writer](https://github.com/grpc/grpc-go/blob/58d4b2b1492dbcfdf26daa7ed93830ebb871faf1/internal/transport/http_util.go#L334).
Counterintuitively, copying the grpc header and message data into a
larger buffer increased the performance by 4% (compared to master).

To validate this behaviour (extra copy increasing performance) I ran
[the k8s benchmark for 1MB
payloads](https://github.com/grpc/grpc/blob/65c9be86830b0e423dd970c066c69a06a9240298/tools/run_tests/performance/scenario_config.py#L291-L305)
and 100 concurrent streams which showed ~5% increase in QPS without the
copies across multiple runs. Adding a copy reduced the performance.

Load test config file:
[loadtest.yaml](https://github.com/user-attachments/files/23055312/loadtest.yaml)

```
# 30 core client and server
Before
QPS: 498.284 (16.6095/server core)
Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us
Server system time: 93.0164
Server user time:   142.533
Client system time: 97.2688
Client user time:   144.542

After
QPS: 526.776 (17.5592/server core)
Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us
Server system time: 96.5959
Server user time:   147.668
Client system time: 101.973
Client user time:   150.234

# 8 core client and server
Before
QPS: 291.049 (36.3811/server core)
Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us
Server system time: 49.0355
Server user time:   87.1783
Client system time: 60.1945
Client user time:   103.633

After
QPS: 334.119 (41.7649/server core)
Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us
Server system time: 69.3136
Server user time:   102.549
Client system time: 80.9804
Client user time:   107.103
```

RELEASE NOTES:
* transport: Avoid two buffer copies when writing Data frames.
@arjan-bal arjan-bal added this to the 1.77 Release milestone Nov 3, 2025
@arjan-bal arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Nov 3, 2025
@codecov

codecov Bot commented Nov 3, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.10256% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.23%. Comparing base (f959da6) to head (13bb904).
⚠️ Report is 1 commits behind head on v1.77.x.

Files with missing lines Patch % Lines
internal/transport/controlbuf.go 63.15% 3 Missing and 4 partials ⚠️
internal/transport/http_util.go 93.02% 3 Missing and 3 partials ⚠️
mem/buffer_slice.go 93.33% 1 Missing and 1 partial ⚠️
internal/transport/http2_client.go 90.90% 1 Missing ⚠️
internal/transport/http2_server.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           v1.77.x    #8690      +/-   ##
===========================================
+ Coverage    82.21%   83.23%   +1.01%     
===========================================
  Files          417      417              
  Lines        32198    32296      +98     
===========================================
+ Hits         26472    26880     +408     
- Misses        4021     4037      +16     
+ Partials      1705     1379     -326     
Files with missing lines Coverage Δ
mem/buffer_pool.go 100.00% <ø> (ø)
internal/transport/http2_client.go 92.71% <90.90%> (+15.78%) ⬆️
internal/transport/http2_server.go 91.30% <90.00%> (ø)
mem/buffer_slice.go 96.45% <93.33%> (-0.85%) ⬇️
internal/transport/http_util.go 94.53% <93.02%> (-0.68%) ⬇️
internal/transport/controlbuf.go 89.50% <63.15%> (-0.75%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal merged commit 4288cfc into grpc:v1.77.x Nov 3, 2025
17 checks passed
@arjan-bal arjan-bal deleted the cherrypick-copyless-framer branch November 3, 2025 10:49
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants