Use GpuArrayBuffer for MeshUniform by superdump · Pull Request #9254 · bevyengine/bevy

superdump · 2023-07-23T14:39:27Z

Objective

Reduce the number of rebindings to enable batching of draw commands

Solution

Use the new GpuArrayBuffer for MeshUniform data to store all MeshUniform data in arrays within fewer bindings
Sort opaque/alpha mask prepass, opaque/alpha mask main, and shadow phases also by the batch per-object data binding dynamic offset to improve performance on WebGL2.

Changelog

Changed: Per-object MeshUniform data is now managed by GpuArrayBuffer as arrays in buffers that need to be indexed into.

Migration Guide

Accessing the model member of an individual mesh object's shader Mesh struct the old way where each MeshUniform was stored at its own dynamic offset:

struct Vertex {
    @location(0) position: vec3<f32>,
};

fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = mesh_position_local_to_clip(
        mesh.model,
        vec4<f32>(vertex.position, 1.0)
    );
    return out;
}

The new way where one needs to index into the array of Meshes for the batch:

struct Vertex {
    @builtin(instance_index) instance_index: u32,
    @location(0) position: vec3<f32>,
};

fn vertex(vertex: Vertex) -> VertexOutput {
    var out: VertexOutput;
    out.clip_position = mesh_position_local_to_clip(
        mesh[vertex.instance_index].model,
        vec4<f32>(vertex.position, 1.0)
    );
    return out;
}

Note that using the instance_index is the default way to pass the per-object index into the shader, but if you wish to do custom rendering approaches you can pass it in however you like.

github-actions · 2023-07-23T14:52:08Z

Example no_renderer failed to run, please try running it locally and check the result.

github-actions · 2023-07-23T14:58:09Z

Example no_renderer failed to run, please try running it locally and check the result.

superdump · 2023-07-23T15:30:57Z

Benchmarks

All benchmarks are running many_cubes -- sphere on a M1 Max MacBook Pro 16". Images show main in yellow and the new option in red.

(2560x1440) Storage buffer with Opaque3d sorted by view z

Not having to re-bind the per-object data per draw command gives a 15% reduction in frame time:

(2560x1440) Storage buffer with Opaque3d sorted by dynamic offset and view z

Sorting by dynamic offset is unnecessary as there is no dynamic offset for the storage buffer case but has no negative impact:

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by view z

This is basically the same as main as rebinding due to different dynamic offsets cannot be avoided as we ignore the dynamic offsets and sort by view z which can then mean the objects appear in different batches and so rebinds for dynamic offsets are needed:

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by dynamic offset and view z

Sorting by dynamic offset and then by view z avoids the need to re-bind the per-object uniform buffer at different dynamic offsets for each draw. This brings an 18.6% reduction in frame time:

nicopap · 2023-07-23T15:52:07Z

crates/bevy_pbr/src/render/mesh_bindings.rs

+    pub(super) fn model(render_device: &RenderDevice, binding: u32) -> BindGroupLayoutEntry {
+        GpuArrayBuffer::<MeshUniform>::binding_layout(
+            binding,
+            ShaderStages::VERTEX_FRAGMENT,
+            render_device,
+        )


I guess skinning and weights can get the same treatment in a future PR right? Since they could use a storage buffer in lieu of a uniform buffer when available.

Let me reformulate: we already use an index and an Uniform for skins and morphs, but is there any benefit to use a storage buffer when available?

We can pack all the skinning data into 1 storage buffer, instead of multiple uniform buffers. Less buffers = less rebinds = more performance :)

Right, it applies to all buffers of struct data, basically. One needs to plumb in the indices, so perhaps a per-object index has an index or offset into a skinning array or so. I don't know too much about those skins work to know how best to structure the data. Another aspect of having a single buffer is if the data itself repeats with varying counts of things (so things like vertices in vertex buffers, or maybe the number of transforms in a skin?) then some kind of allocator is usually used to manage loading in/out data without having to reallocate buffers all the time and rewrite everything, and handle things of varying sizes.

nicopap · 2023-07-23T15:55:25Z

crates/bevy_pbr/src/render/mesh_bindings.rs

+        GpuArrayBuffer::<MeshUniform>::binding_layout(
+            binding,
+            ShaderStages::VERTEX_FRAGMENT,
+            render_device,


In a future PR, it might be worthwhile to newtype device.limits().max_storage_buffers_per_shader_stage, and accept that newtype instead of RenderDevice as argument to GpuArrayBuffer so that it's clear what data it is using.

Mmm. I thought about passing the values explicitly, but then I also thought that maybe other features or limits may impact logic at some point. If it had been more difficult to plumb it in then I would have changed it.

crates/bevy_pbr/src/render/mesh_bindings.wgsl

crates/bevy_pbr/src/render/mesh.rs

nicopap · 2023-07-23T16:54:14Z

Looks good. (I'll do some testing on my end before dropping an approval) What is the "TODO" item supposed to mean? Do you intend to iterate in future PRs or within this one?

superdump · 2023-07-23T19:33:44Z

Looks good. (I'll do some testing on my end before dropping an approval) What is the "TODO" item supposed to mean? Do you intend to iterate in future PRs or within this one?

I'll probably do it in this one.

JMS55 · 2023-07-23T20:22:16Z

Reviewed the PR in it's current state, minus the existing feedback lgtm :)

Also good catch on GpuComponentArrayBufferPlugin not using finish().

…mic offset When using a uniform buffer with batches of per-object data, this provides a ~18% frame time reduction on the many_cubes -- sphere example in the Opaque3d phase and should benefit alpha mask, prepass, and shadow phases in a similar way.

github-actions · 2023-07-24T09:52:52Z

Example no_renderer failed to run, please try running it locally and check the result.

superdump · 2023-07-24T10:04:36Z

Ideally we'd have a good way of benchmarking the impact on prepass and shadows. I may modify the many_cubes example to also support some modes to test the performance of these.

The RenderApp sub app does not exist when the renderer is disabled so assuming it does is not a good idea.

github-actions · 2023-07-24T10:14:23Z

Example no_renderer failed to run, please try running it locally and check the result.

robtfm · 2023-07-24T20:11:24Z

crates/bevy_pbr/src/render/pbr.wgsl

        pbr_input.occlusion = occlusion;

-        pbr_input.flags = mesh.flags;
+        pbr_input.flags = mesh[in.instance_index].flags;


this is using in.instance_index unconditionally, but it's existence depends on VERTEX_OUTPUT_INSTANCE_INDEX. i think that's probably fine since this frag shader is (currently) specific to standard materials, but it doesn't feel great.

as far as i can tell this is the only use of the instance index in any fragment stage. would it make more sense to push the mesh flags through the MeshVertexOutput struct directly (and unconditionally)? then we could get rid of the VERTEX_OUTPUT_INSTANCE_INDEX entirely.

this would not make sense if there are other potential uses for mesh data in the fragment stage but i think there shoudn't be generally.

I think that would add cost to the vertex stage / interpolators but I don't know if it's worth the tradeoff.

After this change, people will need to be aware of per-object data and how to access it. Long-term, various indices will be added into the per-object data for materials, animations, and probably more. At least material stuff I expect to be used in fragment stage. The instance index, or some other way of getting the per-object index into the shader, would be used to index into the per-object data array, and then within that type there would be a material index to index into a material data array.

crates/bevy_pbr/src/render/mesh.rs

robtfm · 2023-07-24T20:20:52Z

crates/bevy_core_pipeline/src/core_3d/mod.rs

+    // NOTE: (dynamic offset, -distance)
    // NOTE: Values increase towards the camera. Front-to-back ordering for opaque means we need a descending sort.
-    type SortKey = Reverse<FloatOrd>;
+    type SortKey = (u32, Reverse<FloatOrd>);


probably not for this pr but just wondering, is there any possibility for / value in some kind of spatial sorting before putting the mesh data in the gpu array? i think currently they are entity iterator sorted, but it seems like (at least for static meshes) there'd be some benefit to having a batch be as proximally local as possible and then to sort batches based on the nearest member, or something like that.

That will happen when the render set reordering and batching is implemented. The reason for reordering the render sets to extract, prepare assets, queue, sort, prepare+batch, render is to allow the order of draws to be known when preparing the data so that the order can be taken into account.

crates/bevy_core_pipeline/src/core_3d/mod.rs

crates/bevy_pbr/src/prepass/mod.rs

Elabajaba · 2023-07-26T05:05:58Z

This seems to be ignoring object transforms when rendering stuff, as everything seems to be rendering in the same place (tested 3d_scene: cube is lower than it should be, 3d_shapes: everything is mashed together instead of being separate, and scene_viewer (with a large custom scene): everything is mashed together, and the object clump jumps around weirdly when moving the camera around.)

edit: Works properly on vulkan, broken on dx12.

crates/bevy_pbr/src/render/mesh.rs

nicopap · 2023-07-26T08:29:42Z

I'm getting a 45% speedup on vulkan/linux with your latest set of changes for many_cubes (no additional flags). A 1% speedup on many_foxes. Looking at the sub app=RenderApp and schedule=Render traces.

I'm unable to test the OpenGL backend as I get a panic with WGPU_BACKEND=gl (but that's also true of main).

superdump · 2023-07-26T20:10:57Z

@nicopap nice!

@Elabajaba - oof, ok, I guess this isn't mergeable if it breaks DX12. Could anyone test DX12 on NVIDIA as I think @Elabajaba uses AMD. I will test on a mobile RTX 3080 as soon as I can.

robtfm · 2023-07-26T20:14:34Z

Could anyone test DX12 on NVIDIA

looks like always index 0 on nvidia/dx12 as well

Elabajaba · 2023-07-26T20:15:59Z

The vulkan gains seem big enough that it might be worth forcing dx12 to always draw instead of draw_indexed for now, and try and get it implemented in wgpu for their next release?

edit: Not sure how many people are actually using bevy dx12, or what any potential performance losses might be (or if forcing draw instead of draw_indexed would break stuff).

superdump · 2023-07-26T20:28:39Z

All of these benchmarks are running many_cubes on an M1 Max, main (yellow) using a uniform buffer with a dynamic offset per object, vs PR (red) using GpuArrayBuffer (WebGL2 uses batches of per object data as dynamic offsets in a uniform buffer, but if not WebGL2 then just one large storage buffer).

Storage buffers

Storage buffers with main pass only

Storage buffer with depth prepass and main pass

Storage buffer with cascaded shadow mapping, and main pass

Storage buffer with depth prepass, cascaded shadow mapping, and main pass

WebGL2

WebGL2 with main pass only

WebGL2 with depth prepass, and main pass

WebGL2 with cascaded shadow mapping, and main pass

WebGL2 with depth prepass, cascaded shadow mapping, and main pass

Summary

Looking only at median frame times:

Storage buffers
- main pass: -13.3%
- prepass + main pass: -18.6%
- CSM + main pass: -11.1%
- prepass + CSM + main pass: -17.8%
WebGL2
- main pass: -15.7%
- prepass + main pass: -23.5%
- CSM + main pass: -23.3%
- prepass + CSM + main pass: -18.0%

So, big gains all-round from looking at the CPU side. I would need to run the GPU timestamp query stuff on Windows to check what's happening on the GPU side.

superdump · 2023-07-27T14:13:43Z

On a 5900HS and mobile RTX 3080 (on the power adapter) using Vulkan - main (yellow) vs PR (red) at 1280x720:

Storage buffer

Storage buffer with main pass only

Storage buffer with prepass, and main pass

Storage buffer with cascaded shadow mapping, and main pass

Storage buffer with prepass, cascaded shadow mapping, and main pass

WebGL2

WebGL2 with main pass only

WebGL2 with prepass, and main pass

WebGL2 with cascaded shadow mapping, and main pass

WebGL2 with prepass, cascaded shadow mapping, and main pass

Summary

Storage buffer
- main pass: -7.4%
- prepass + main pass: -12.7%
- cascaded shadow mapping + main pass: -9.0%
- prepass + cascaded shadow mapping + main pass: -10.2%
WebGL2
- main pass: -12.8%
- prepass + main pass: -18.2%
- cascaded shadow mapping + main pass: -13.1%
- prepass + cascaded shadow mapping + main pass: -13.0%

superdump · 2023-07-27T14:41:24Z

I'm satisfied with the performance benefit. The blocking issue is DX12 instance_index being broken.

…d of vertex.instance_index

Dx12 instance_index bugfix

superdump · 2023-07-30T12:54:52Z

The DX12 issue was fixed by @Elabajaba - thanks for figuring that one out!

# Objective - Fix shader_material_glsl example ## Solution - Expose the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through the default `MeshPipeline` specialization. - Make use of it in the `custom_material.vert` shader to access the mesh binding. --- ## Changelog - Added: Exposed the `PER_OBJECT_BUFFER_BATCH_SIZE` shader def through the default `MeshPipeline` specialization to use in custom shaders not using bevy_pbr::mesh_bindings that still want to use the mesh binding in some way.

Use GpuArrayBuffer for MeshUniform

382737d

superdump force-pushed the use-gpu-list-for-per-object-data branch from 574516f to 382737d Compare July 23, 2023 14:45

nicopap reviewed Jul 23, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh_bindings.wgsl Outdated Show resolved Hide resolved

nicopap reviewed Jul 23, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

nicopap reviewed Jul 23, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

nicopap added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Jul 23, 2023

nicopap added this to the 0.12 milestone Jul 23, 2023

JMS55 approved these changes Jul 23, 2023

View reviewed changes

superdump added 3 commits July 24, 2023 11:37

Put group and binding directly above binding declarations

b0698be

Document MeshPipeline per_object_buffer_batch_size

40112bf

Rename INSTANCE_INDEX to VERTEX_OUTPUT_INSTANCE_INDEX for clarity

1f2f75e

Fix no_renderer

23b4793

The RenderApp sub app does not exist when the renderer is disabled so assuming it does is not a good idea.

robtfm self-requested a review July 24, 2023 10:17

robtfm reviewed Jul 24, 2023

View reviewed changes

robtfm and others added 2 commits July 25, 2023 19:12

Use SmallVec for flexible dynamic offset gathering in SetMeshBindGroup

505af3e

Clarify per-object binding dynamic offset's purpose in phase items

8c50fdb

superdump force-pushed the use-gpu-list-for-per-object-data branch from 392408d to 4c8fba0 Compare July 25, 2023 18:02

Use a module-associated shader def for PER_OBJECT_BUFFER_BATCH_SIZE

dae52b7

superdump force-pushed the use-gpu-list-for-per-object-data branch from 4c8fba0 to dae52b7 Compare July 25, 2023 18:11

nicopap reviewed Jul 26, 2023

View reviewed changes

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved

Replace SmallVec with slice for performance

331e4fc

Elabajaba added 2 commits July 29, 2023 00:11

fix dx12

ba8cf71

add comments on why we're using vertex_no_morph.instance index instea…

f1ece4a

…d of vertex.instance_index

superdump mentioned this pull request Jul 29, 2023

GPU Instancing #89

Closed

Elabajaba and others added 3 commits July 29, 2023 16:43

add links to naga issue

f41899a

Break long line

3525d82

Merge pull request #31 from Elabajaba/dx12-index-bugfix

4574cae

Dx12 instance_index bugfix

robtfm approved these changes Jul 30, 2023

View reviewed changes

superdump added this pull request to the merge queue Jul 30, 2023

Merged via the queue into bevyengine:main with commit e6405bb Jul 30, 2023

superdump added a commit to superdump/bevy that referenced this pull request Jul 30, 2023

Fix shader_material_glsl example after bevyengine#9254

fe41929

superdump added a commit to superdump/bevy that referenced this pull request Jul 31, 2023

Fix shader_material_glsl example after bevyengine#9254

0c03ae8

nicopap mentioned this pull request Aug 3, 2023

Add support for KHR_texture_transform #8266

Closed

mockersf mentioned this pull request Aug 6, 2023

WebGL2 rendering broken: correct transform are not used #9375

Closed

cart mentioned this pull request Oct 13, 2023

News: Release 0.12 bevyengine/bevy-website#754

Merged

43 tasks

Bcompartment mentioned this pull request Mar 27, 2024

Update to bevy 0.13+ pinkponk/bevy_efficient_forest_rendering#6

Open

Uh oh!

Conversation

superdump commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Objective

Solution

Changelog

Migration Guide

Uh oh!

github-actions bot commented Jul 23, 2023

Uh oh!

github-actions bot commented Jul 23, 2023

Uh oh!

superdump commented Jul 23, 2023

Benchmarks

(2560x1440) Storage buffer with Opaque3d sorted by view z

(2560x1440) Storage buffer with Opaque3d sorted by dynamic offset and view z

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by view z

(WebGL2 at 1280x720) Uniform buffer with Opaque3d sorted by dynamic offset and view z

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicopap commented Jul 23, 2023

Uh oh!

superdump commented Jul 23, 2023

Uh oh!

JMS55 commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 24, 2023

Uh oh!

superdump commented Jul 24, 2023

Uh oh!

github-actions bot commented Jul 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Elabajaba commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nicopap commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superdump commented Jul 26, 2023

Uh oh!

robtfm commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Elabajaba commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

superdump commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Storage buffers

Storage buffers with main pass only

Storage buffer with depth prepass and main pass

Storage buffer with cascaded shadow mapping, and main pass

superdump commented Jul 23, 2023 •

edited

Loading

JMS55 commented Jul 23, 2023 •

edited

Loading

Elabajaba commented Jul 26, 2023 •

edited

Loading

nicopap commented Jul 26, 2023 •

edited

Loading

robtfm commented Jul 26, 2023 •

edited

Loading

Elabajaba commented Jul 26, 2023 •

edited

Loading

superdump commented Jul 26, 2023 •

edited

Loading

superdump commented Jul 27, 2023 •

edited

Loading

superdump commented Jul 30, 2023 •

edited

Loading