Use a giant UBO to optimize performance in 2D [OpenGL3]#66861
Merged
Conversation
ae7b8df to
fb0f58a
Compare
akien-mga
reviewed
Oct 4, 2022
lawnjelly
reviewed
Oct 5, 2022
Member
There was a problem hiding this comment.
Does RID() self initialize in 4.x?
Member
Author
There was a problem hiding this comment.
I believe RID tex = RID(); is the same as RID tex; if that is what you are asking
Member
There was a problem hiding this comment.
Yeah so the point is that usually we wouldn't explicitly initialize it in the header, like Vector3 or String. But it's not a big deal :)
lawnjelly
approved these changes
Oct 5, 2022
lawnjelly
left a comment
Member
There was a problem hiding this comment.
Looks fine to me. I haven't given a hugely in depth look, and have done some basic testing and it seems to work okay.
Any more reviewers obviously welcome, but I suspect this will be a merge and then continuous improvement / bug fixing.
This removes the countless small UBO writes we had before and replaces them with a single large write per render pass. This results in much faster rendering on low-end devices but improves speed on all devices.
fb0f58a to
154b9c1
Compare
Member
Author
|
Just force pushed an update to resolve merge conflicts. Should be ready to merge now |
Member
|
Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is primarily an optimization and cleanup PR. I've had this idea since I first implemented the 2D renderer and have waited until now to implement it.
Previously I was disappointed in the performance of the old batching method on low-end devices. On high-end devices it was great, but the Opengl3 renderer is supposed to be the low-end focused renderer. So I felt like I needed to rethink things.
During the Godot sprint, Juan and I discussed some factors that may be slowing down performance disproportionately on the low end, among them are:
The problem
The old batching renderer worked as follows:
2.a. set OpenGL state
2.b. for every canvas item command:
2.b.i check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> upload current batch data to new UBO, if needed set OpenGL state, render batch
This worked really well on high-end devices as it allowed the GPU to parallelize UBO buffer uploads and draw commands. On older devices we ended up with a huge performance penalty for drawing right after upload and it appeared that each draw command was still sequential.
The end result was that small draws were taking up about 4x as much time as they should. In practice, all batches took at least as much time as a batch with about 10 elements.
The solution
The solution is to record batches in advance, upload all the batches to one UBO and then issue draw commands from that UBO. To do so, we rely on the fact that a UBO can be as large as we want as long as we only bind the maximum UBO size we are fine
The new batching renderer works as follows:
2.a. init batch data
2.b. for every canvas item command:
2.b.i. check if can batch
2.b.ii. if can batch -> add to batch
2.b.iii. if can't batch -> create new batch
3.a. Bind range of UBO needed
3.b. set opengl state
3.c. render batch
This significantly cuts down on the cost of uploading the draw data as well as minimizes the time the draw commands need to wait for the data upload.
Additionally instead of using instanced drawing to draw our batches we rely on a dummy element array that is set up to draw 512 quads (4 vertices, 6 indices). This gets around the performance penalty of using small instances with instanced rendering. Small batches now render much faster.
Metrics
Memory usage:
Previously memory usage was batch_max_size(512) * instance_size (128 bytes) * total batches in viewport * 3 for each viewport
Typically in editor we have a few hundred UBOs in play: ~40mb
New memory usage is: max_instance_count (configurable, defaults to 16384) * instance_size (128 bytes) * 3 for each viewport
Typically in editor we have 4 total UBOs: ~8mb
Performance
Depending on the device I measured performance using either RenderDoc or Intel Graphics Analyzer. Accordingly the absolute values are not necessarily accurate, the relative values however should be mostly correct or at least within a reasonable range.
Fixes: #65977
Fixes: #66463
The future
The first performance issue I identified is not fully solved, we are still using gl_VertexID to read per-instance data. This incurs the same penalty as using gl_InstanceID, in either case the value is not uniform for the draw call. We can mitigate this in two ways:
0instead of calculating the index from gl_VertexID