Skip to content

Don't preallocate 600MB for GPUI profiler#45197

Merged
Veykril merged 15 commits intozed-industries:mainfrom
sourcefrog:threadtiming-ram
Mar 27, 2026
Merged

Don't preallocate 600MB for GPUI profiler#45197
Veykril merged 15 commits intozed-industries:mainfrom
sourcefrog:threadtiming-ram

Conversation

@sourcefrog
Copy link
Copy Markdown
Contributor

@sourcefrog sourcefrog commented Dec 18, 2025

Previously, the GPUI profiler allocates one CircularBuffer per thread, and CircularBuffer<N> always preallocates space for N entries. As a result it allocates ~20MB/thread, and on my machine about 33 threads are created at startup for a total of 600MB used.

In this PR I change it to use a VecDeque that can gradually grow up to 20MB as data is written. At least in my experiments it seems that this caps overall usage at about 21MB perhaps because only one thread writes very much usage data.

Since this is fixed overhead for everyone running Zed it seems like a worthwhile gain.

This also folds duplicated code across platforms into the common gpui profiler.

Before:

Image

After:

image

I got here from #35780 but I don't think this is tree-size related, it seems to be fixed overhead.

Release Notes:

  • Improved: Significantly less memory used to record internal profiling information.

@cla-bot cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Dec 18, 2025
@github-actions github-actions bot added the community champion Issues filed by our amazing community champions! 🫶 label Dec 18, 2025
@github-project-automation github-project-automation bot moved this to Community Champion PRs in Quality Week – December 2025 Dec 18, 2025
@sourcefrog sourcefrog force-pushed the threadtiming-ram branch 2 times, most recently from 5b69d24 to bad8fc8 Compare December 18, 2025 04:57
Previously, it allocates 20MB/thread for a total of 600MB RAM used.

After this we keep only as much memory is needed: it might grow that
large, but apparently most threads won't need it.
@sourcefrog sourcefrog changed the title WIP: Use a growable VecDeque rather than a fixed size CircularBuffer Don't preallocate 600MB for gpui profiler Dec 18, 2025
@sourcefrog sourcefrog marked this pull request as ready for review December 18, 2025 05:14
@maxdeviant maxdeviant changed the title Don't preallocate 600MB for gpui profiler Don't preallocate 600MB for GPUI profiler Dec 18, 2025
@localcc
Copy link
Copy Markdown
Contributor

localcc commented Dec 18, 2025

i'm not sure how i feel about allocations in the task execution path, i guess with amortized growth of the vector it shouldn't be too much of a concern, having profiling data on this would be good, could you take some measurements in heavy task load/low task load scenarios?

edit: also 20MB per thread was probably an overallocation that i forgot to remove, ideally it should be somewhere around 5 which would significantly decrease mem usage

@sourcefrog
Copy link
Copy Markdown
Contributor Author

Sure, do you have any pointers on how to generate the right kind of load or how to measure it? Or do you mean just CPU profiling?

@sourcefrog
Copy link
Copy Markdown
Contributor Author

I think the cost of resizing it will be low: as it grows from 10MB to 20MB, the CPU needs to copy 10MB of data which takes maybe 1ms, assuming the allocation cannot grow in place, and this will be hit only once per thread. I'll run a CPU time profile.

On the other hand, arguably the large allocations won't have too much of an effect on user experience: virtual memory is allocated but as it's not touched or zeroed it shouldn't turn into physical memory allocations. However because it dominates the heap profile, it also makes it harder to see where memory actually is being wasted.

@sourcefrog
Copy link
Copy Markdown
Contributor Author

I ran this build under Linux perf and if I'm reading it correctly there's one single sample hit in add_task_timing, taking an estimated 2.5ms.

@sourcefrog
Copy link
Copy Markdown
Contributor Author

sourcefrog commented Jan 8, 2026

@localcc, when you are back from your break, no rush but let me know how you would like me to proceed? The cross-platform tests are not fixed up and I want to check that the changes I previously made there are actually preserving the same semantics.

I'm happy to finish this but I won't do that unless you think you want to merge it.

I think there is a mild argument that allocating on demand is better by at least avoiding the appearance of wasting memory, and making it more clear where the memory is actually used, and resizing the buffers is unlikely to cost much. On the other hand it probably won't use very much physical memory so arguably it's not worth changing, or potentially conflicting with other work you may want to do here. Just let me know.

@143mailliw
Copy link
Copy Markdown
Contributor

I would love to see this merged. I ran into this when profiling Hummingbird and I think it's rather ridiculous to use this much memory for this.

@sourcefrog
Copy link
Copy Markdown
Contributor Author

@MrSubidubi I see you pushing other stuff: thanks, but let me double check the semantics of my changes before you merge it, because this is mostly not tested.

@sourcefrog
Copy link
Copy Markdown
Contributor Author

OK I think this is reasonable to merge. It could be cleaned up more, especially in extracting common cross-platform code, but I think it's a step forward. It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

@MrSubidubi MrSubidubi assigned MrSubidubi and unassigned localcc Feb 9, 2026
@MrSubidubi
Copy link
Copy Markdown
Member

Appreciate you taking another look! Wanted to have the failing CI out of the way so that we can finally get this in.

It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

It would be awesome if you could still give it a shot as part of this here. I'll (and probably @\SomeoneToIgnore too) will happily greenlight CI whenever you push something and we see it, if that works for you.

Unify more cross-platform profiler code so that more of the details are hidden in the profiler
implementation

Keep the `total_pushed` counter recently added

Add free functions for get_current_thread_task_timings and add_task_timing
Copilot AI review requested due to automatic review settings February 27, 2026 15:19
@sourcefrog
Copy link
Copy Markdown
Contributor Author

Appreciate you taking another look! Wanted to have the failing CI out of the way so that we can finally get this in.

It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

It would be awesome if you could still give it a shot as part of this here. I'll (and probably @\SomeoneToIgnore too) will happily greenlight CI whenever you push something and we see it, if that works for you.

OK, I've merged main and I think this is a worthwhile cleanup in addition to reducing the preallocation, and ready to merge. Please take another look?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes memory usage in the GPUI profiler by replacing pre-allocated CircularBuffer instances with lazily-allocated VecDeque instances. Previously, each thread allocated ~20MB upfront for profiling data, leading to ~600MB total memory usage with 33 threads. With this change, the profiler now uses VecDeque that grows on demand, capping total usage at approximately 21MB.

Changes:

  • Replaced CircularBuffer with VecDeque for storing task timing entries per thread
  • Reduced capacity from 20MB to 16MiB (power of 2 for efficient growth)
  • Extracted duplicated thread timing retrieval code into shared get_current_thread_task_timings() function
  • Removed circular-buffer dependency from gpui crate

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/gpui/src/profiler.rs Core changes: replaced CircularBuffer with VecDeque, added capacity management logic, introduced helper functions for thread timing operations
crates/gpui_macos/src/dispatcher.rs Refactored to use shared profiler functions, removed duplicated code
crates/gpui_windows/src/dispatcher.rs Refactored to use shared profiler functions, removed duplicated code
crates/gpui_linux/src/linux/dispatcher.rs Refactored to use shared profiler functions, removed duplicated code
crates/gpui/Cargo.toml Removed circular-buffer dependency from gpui crate
Cargo.lock Updated to reflect circular-buffer removal from gpui dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@sourcefrog
Copy link
Copy Markdown
Contributor Author

@MrSubidubi wdyt?

Copy link
Copy Markdown
Member

@Veykril Veykril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Veykril Veykril enabled auto-merge (squash) March 17, 2026 07:18
@Veykril
Copy link
Copy Markdown
Member

Veykril commented Mar 17, 2026

needs a rebase for CI to become happy

@Veykril Veykril assigned Veykril and unassigned MrSubidubi Mar 17, 2026
@Veykril Veykril merged commit a922831 into zed-industries:main Mar 27, 2026
30 checks passed
@github-project-automation github-project-automation bot moved this from Community Champion PRs to Done in Quality Week – December 2025 Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The user has signed the Contributor License Agreement community champion Issues filed by our amazing community champions! 🫶

Projects

Development

Successfully merging this pull request may close these issues.

8 participants