Build batches across phases in parallel.#17764
Conversation
Currently, invocations of `batch_and_prepare_binned_render_phase` and `batch_and_prepare_sorted_render_phase` can't run in parallel because they write to scene-global GPU buffers. After PR bevyengine#17698, `batch_and_prepare_binned_render_phase` started accounting for the lion's share of the CPU time, causing us to be strongly CPU bound on scenes like Caldera when occlusion culling was on (because of the overhead of batching for the Z-prepass). Although I eventually plan to optimize `batch_and_prepare_binned_render_phase`, we can obtain significant wins now by parallelizing that system across phases. This commit splits all GPU buffers that `batch_and_prepare_binned_render_phase` and `batch_and_prepare_sorted_render_phase` touches into separate buffers for each phase so that the scheduler will run those phases in parallel. At the end of batch preparation, we gather the render phases up into a single resource with a new *collection* phase. Because we already run mesh preprocessing separately for each phase in order to make occlusion culling work, this is actually a cleaner separation. For example, mesh output indices (the unique ID that identifies each mesh instance on GPU) are now guaranteed to be sequential starting from 0, which will simplify the forthcoming work to remove them in favor of the compute dispatch ID. On Caldera, this brings the frame time down to approximately 9.1 ms with occlusion culling on.
08de0fe to
c1f9764
Compare
examples/3d/occlusion_culling.rs
Outdated
| ..default() | ||
| }) | ||
| .set(PbrPlugin { | ||
| allow_copies_from_indirect_parameters: true, |
There was a problem hiding this comment.
Could this be reused from the RenderPlugin? Rather than having to set it in two places.
There was a problem hiding this comment.
Could you elaborate as to how that would work? The problem is that the PbrPlugin can't reach into the RenderPlugin to check its value.
There was a problem hiding this comment.
I think this is fine for now but it's a rather particular option and I might suggest subsuming it some kind of broader "debug renderer" setting if we ever have a second instance of this kind of thing.
There was a problem hiding this comment.
Yeah, good point. Or maybe a "debug flags"?
There was a problem hiding this comment.
I went ahead and switched this to a RenderDebugFlags so that we can have more of them.
tychedelia
left a comment
There was a problem hiding this comment.
Lgtm. Thanks for creating the debug flags, a lot cleaner. Being able to lean on the ECS scheduler for this is nice and clean.
# Objective Fix panic in `custom_render_phase`. This example was broken by #17764, but that breakage evolved into a panic after #17849. This new panic seems to illustrate the problem in a pretty straightforward way. ``` 2025-02-15T00:44:11.833622Z INFO bevy_diagnostic::system_information_diagnostics_plugin::internal: SystemInfo { os: "macOS 15.3 Sequoia", kernel: "24.3.0", cpu: "Apple M4 Max", core_count: "16", memory: "64.0 GiB" } 2025-02-15T00:44:11.908328Z INFO bevy_render::renderer: AdapterInfo { name: "Apple M4 Max", vendor: 0, device: 0, device_type: IntegratedGpu, driver: "", driver_info: "", backend: Metal } 2025-02-15T00:44:12.314930Z INFO bevy_winit::system: Creating new window App (0v1) thread 'Compute Task Pool (1)' panicked at /Users/me/src/bevy/crates/bevy_ecs/src/system/function_system.rs:216:28: bevy_render::batching::gpu_preprocessing::batch_and_prepare_sorted_render_phase<custom_render_phase::Stencil3d, custom_render_phase::StencilPipeline> could not access system parameter ResMut<PhaseBatchedInstanceBuffers<Stencil3d, MeshUniform>> ``` ## Solution Add a `SortedRenderPhasePlugin` for the custom phase. ## Testing `cargo run --example custom_render_phase`
Currently, invocations of
batch_and_prepare_binned_render_phaseandbatch_and_prepare_sorted_render_phasecan't run in parallel because they write to scene-global GPU buffers. After PR #17698,batch_and_prepare_binned_render_phasestarted accounting for the lion's share of the CPU time, causing us to be strongly CPU bound on scenes like Caldera when occlusion culling was on (because of the overhead of batching for the Z-prepass). Although I eventually plan to optimizebatch_and_prepare_binned_render_phase, we can obtain significant wins now by parallelizing that system across phases.This commit splits all GPU buffers that
batch_and_prepare_binned_render_phaseandbatch_and_prepare_sorted_render_phasetouches into separate buffers for each phase so that the scheduler will run those phases in parallel. At the end of batch preparation, we gather the render phases up into a single resource with a new collection phase. Because we already run mesh preprocessing separately for each phase in order to make occlusion culling work, this is actually a cleaner separation. For example, mesh output indices (the unique ID that identifies each mesh instance on GPU) are now guaranteed to be sequential starting from 0, which will simplify the forthcoming work to remove them in favor of the compute dispatch ID.On Caldera, this brings the frame time down to approximately 9.1 ms with occlusion culling on.