This is in particular a problem for bindless workflows. Compute passes typically have to emit a lot of barriers between dispatch calls in order to make sure that reads/writes from one dispatch don't affect the next if the same resource may be used.
Timings for a issuing 1000 dispatch with 6000 resources bound once:
Computepass: Bindless/1000 dispatch
time: [139.48 ms 140.17 ms 141.25 ms]
thrpt: [7.0795 Kelem/s 7.1343 Kelem/s 7.1692 Kelem/s]
Found 13 outliers among 100 measurements (13.00%)
5 (5.00%) high mild
8 (8.00%) high severe
For comparison on the same machine issuing 10x dispatches, each using 6 resources, binding before each dispatch:
Computepass: Single Threaded/1 computepasses x 10000 dispatches (Computepass Time)
time: [19.353 ms 19.565 ms 19.792 ms]
thrpt: [505.25 Kelem/s 511.11 Kelem/s 516.72 Kelem/s]
The bindless version is necessarily slower since it has to emit a lot more barriers speculatively and there's no way around it really. But it would be surprising if we couldn't do a lot better.
On the same machine the corresponding resnderpass test runs with more than 10x the resources & draw calls 10x faster (resulting in a 100x throughput of draw calls):
Renderpass: Bindless/10000 draws
time: [11.720 ms 11.820 ms 11.921 ms]
thrpt: [838.88 Kelem/s 846.04 Kelem/s 853.28 Kelem/s]
(there's only write-only resources involved here so the comparision isn't quite accurate.
This is in particular a problem for bindless workflows. Compute passes typically have to emit a lot of barriers between dispatch calls in order to make sure that reads/writes from one dispatch don't affect the next if the same resource may be used.
Timings for a issuing 1000 dispatch with 6000 resources bound once:
For comparison on the same machine issuing 10x dispatches, each using 6 resources, binding before each dispatch:
The bindless version is necessarily slower since it has to emit a lot more barriers speculatively and there's no way around it really. But it would be surprising if we couldn't do a lot better.
On the same machine the corresponding resnderpass test runs with more than 10x the resources & draw calls 10x faster (resulting in a 100x throughput of draw calls):
(there's only write-only resources involved here so the comparision isn't quite accurate.