Don't preallocate 600MB for GPUI profiler by sourcefrog · Pull Request #45197 · zed-industries/zed

sourcefrog · 2025-12-18T04:39:45Z

Previously, the GPUI profiler allocates one CircularBuffer per thread, and CircularBuffer<N> always preallocates space for N entries. As a result it allocates ~20MB/thread, and on my machine about 33 threads are created at startup for a total of 600MB used.

In this PR I change it to use a VecDeque that can gradually grow up to 20MB as data is written. At least in my experiments it seems that this caps overall usage at about 21MB perhaps because only one thread writes very much usage data.

Since this is fixed overhead for everyone running Zed it seems like a worthwhile gain.

This also folds duplicated code across platforms into the common gpui profiler.

Before:

After:

I got here from #35780 but I don't think this is tree-size related, it seems to be fixed overhead.

Release Notes:

Improved: Significantly less memory used to record internal profiling information.

Previously, it allocates 20MB/thread for a total of 600MB RAM used. After this we keep only as much memory is needed: it might grow that large, but apparently most threads won't need it.

Avoids exposing the details of how we maintain the length

localcc · 2025-12-18T15:17:35Z

i'm not sure how i feel about allocations in the task execution path, i guess with amortized growth of the vector it shouldn't be too much of a concern, having profiling data on this would be good, could you take some measurements in heavy task load/low task load scenarios?

edit: also 20MB per thread was probably an overallocation that i forgot to remove, ideally it should be somewhere around 5 which would significantly decrease mem usage

sourcefrog · 2025-12-18T16:03:40Z

Sure, do you have any pointers on how to generate the right kind of load or how to measure it? Or do you mean just CPU profiling?

sourcefrog · 2025-12-19T22:58:21Z

I think the cost of resizing it will be low: as it grows from 10MB to 20MB, the CPU needs to copy 10MB of data which takes maybe 1ms, assuming the allocation cannot grow in place, and this will be hit only once per thread. I'll run a CPU time profile.

On the other hand, arguably the large allocations won't have too much of an effect on user experience: virtual memory is allocated but as it's not touched or zeroed it shouldn't turn into physical memory allocations. However because it dominates the heap profile, it also makes it harder to see where memory actually is being wasted.

sourcefrog · 2025-12-20T17:39:55Z

I ran this build under Linux perf and if I'm reading it correctly there's one single sample hit in add_task_timing, taking an estimated 2.5ms.

sourcefrog · 2026-01-08T13:37:24Z

@localcc, when you are back from your break, no rush but let me know how you would like me to proceed? The cross-platform tests are not fixed up and I want to check that the changes I previously made there are actually preserving the same semantics.

I'm happy to finish this but I won't do that unless you think you want to merge it.

I think there is a mild argument that allocating on demand is better by at least avoiding the appearance of wasting memory, and making it more clear where the memory is actually used, and resizing the buffers is unlikely to cost much. On the other hand it probably won't use very much physical memory so arguably it's not worth changing, or potentially conflicting with other work you may want to do here. Just let me know.

143mailliw · 2026-01-17T10:26:45Z

I would love to see this merged. I ran into this when profiling Hummingbird and I think it's rather ridiculous to use this much memory for this.

sourcefrog · 2026-02-09T14:46:37Z

@MrSubidubi I see you pushing other stuff: thanks, but let me double check the semantics of my changes before you merge it, because this is mostly not tested.

sourcefrog · 2026-02-09T15:08:59Z

OK I think this is reasonable to merge. It could be cleaned up more, especially in extracting common cross-platform code, but I think it's a step forward. It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

MrSubidubi · 2026-02-09T17:36:32Z

Appreciate you taking another look! Wanted to have the failing CI out of the way so that we can finally get this in.

It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

It would be awesome if you could still give it a shot as part of this here. I'll (and probably @\SomeoneToIgnore too) will happily greenlight CI whenever you push something and we see it, if that works for you.

Unify more cross-platform profiler code so that more of the details are hidden in the profiler implementation Keep the `total_pushed` counter recently added Add free functions for get_current_thread_task_timings and add_task_timing

sourcefrog · 2026-02-27T15:21:15Z

Appreciate you taking another look! Wanted to have the failing CI out of the way so that we can finally get this in.

It's hard to iterate on the cross-platform builds because the CI runners need approval every time for non-employee contributions.

It would be awesome if you could still give it a shot as part of this here. I'll (and probably @\SomeoneToIgnore too) will happily greenlight CI whenever you push something and we see it, if that works for you.

OK, I've merged main and I think this is a worthwhile cleanup in addition to reducing the preallocation, and ready to merge. Please take another look?

Copilot

Pull request overview

This PR optimizes memory usage in the GPUI profiler by replacing pre-allocated CircularBuffer instances with lazily-allocated VecDeque instances. Previously, each thread allocated ~20MB upfront for profiling data, leading to ~600MB total memory usage with 33 threads. With this change, the profiler now uses VecDeque that grows on demand, capping total usage at approximately 21MB.

Changes:

Replaced CircularBuffer with VecDeque for storing task timing entries per thread
Reduced capacity from 20MB to 16MiB (power of 2 for efficient growth)
Extracted duplicated thread timing retrieval code into shared get_current_thread_task_timings() function
Removed circular-buffer dependency from gpui crate

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
crates/gpui/src/profiler.rs	Core changes: replaced CircularBuffer with VecDeque, added capacity management logic, introduced helper functions for thread timing operations
crates/gpui_macos/src/dispatcher.rs	Refactored to use shared profiler functions, removed duplicated code
crates/gpui_windows/src/dispatcher.rs	Refactored to use shared profiler functions, removed duplicated code
crates/gpui_linux/src/linux/dispatcher.rs	Refactored to use shared profiler functions, removed duplicated code
crates/gpui/Cargo.toml	Removed circular-buffer dependency from gpui crate
Cargo.lock	Updated to reflect circular-buffer removal from gpui dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/gpui/src/profiler.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

sourcefrog · 2026-03-07T15:13:57Z

@MrSubidubi wdyt?

Veykril

Thanks!

Veykril · 2026-03-17T07:18:53Z

needs a rebase for CI to become happy

cla-bot bot added the cla-signed The user has signed the Contributor License Agreement label Dec 18, 2025

github-actions bot added the community champion Issues filed by our amazing community champions! 🫶 label Dec 18, 2025

github-project-automation bot added this to Quality Week – December 2025 Dec 18, 2025

github-project-automation bot moved this to Community Champion PRs in Quality Week – December 2025 Dec 18, 2025

sourcefrog force-pushed the threadtiming-ram branch 2 times, most recently from 5b69d24 to bad8fc8 Compare December 18, 2025 04:57

Use a growable VecDeque rather than a fixed size CircularBuffer

93fda1b

Previously, it allocates 20MB/thread for a total of 600MB RAM used. After this we keep only as much memory is needed: it might grow that large, but apparently most threads won't need it.

sourcefrog force-pushed the threadtiming-ram branch from bad8fc8 to 93fda1b Compare December 18, 2025 04:57

Can use VecDeque::back_mut

5de3976

sourcefrog changed the title ~~WIP: Use a growable VecDeque rather than a fixed size CircularBuffer~~ Don't preallocate 600MB for gpui profiler Dec 18, 2025

sourcefrog marked this pull request as ready for review December 18, 2025 05:14

SomeoneToIgnore assigned localcc Dec 18, 2025

maxdeviant changed the title ~~Don't preallocate 600MB for gpui profiler~~ Don't preallocate 600MB for GPUI profiler Dec 18, 2025

sourcefrog added 3 commits December 18, 2025 07:04

gpui no longer needs circular-buffer dependency

9dcf04e

Make ThreadTimings::timings private

1f14fb1

Avoids exposing the details of how we maintain the length

Comment on MAX_TASK_TIMINGS

d2bf24d

sourcefrog force-pushed the threadtiming-ram branch from f37039e to d2bf24d Compare December 18, 2025 15:11

sourcefrog added 2 commits December 20, 2025 09:13

Fix up macOS and Windows builds

9334bf1

Keep profiler buffer no more than 20MB

f079dc8

marcocondrache mentioned this pull request Feb 2, 2026

gpui: Reduce size of thread timings buffer #48192

Closed

MrSubidubi added 3 commits February 9, 2026 09:50

Merge branch 'main' into threadtiming-ram

e31abd0

Make pub again

2017255

Visibility

03bb88c

MrSubidubi assigned MrSubidubi and unassigned localcc Feb 9, 2026

sourcefrog added 2 commits February 10, 2026 20:49

Simplify extraction of a vec from VecDeque

2eebe32

Merge main

33a1866

Unify more cross-platform profiler code so that more of the details are hidden in the profiler implementation Keep the `total_pushed` counter recently added Add free functions for get_current_thread_task_timings and add_task_timing

Copilot AI review requested due to automatic review settings February 27, 2026 15:19

Copilot started reviewing on behalf of sourcefrog February 27, 2026 15:20 View session

Tidy imports

f34e4ed

Copilot AI reviewed Feb 27, 2026

View reviewed changes

crates/gpui/src/profiler.rs Outdated Show resolved Hide resolved

Make space down to MAX_TASK_TIMINGS-1

9230e39

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Veykril approved these changes Mar 17, 2026

View reviewed changes

Veykril enabled auto-merge (squash) March 17, 2026 07:18

Veykril assigned Veykril and unassigned MrSubidubi Mar 17, 2026

Merge branch 'main' into threadtiming-ram

8b3c14a

Veykril merged commit a922831 into zed-industries:main Mar 27, 2026
30 checks passed

github-project-automation bot moved this from Community Champion PRs to Done in Quality Week – December 2025 Mar 27, 2026

Uh oh!

Conversation

sourcefrog commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

localcc commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcefrog commented Dec 18, 2025

Uh oh!

sourcefrog commented Dec 19, 2025

Uh oh!

sourcefrog commented Dec 20, 2025

Uh oh!

sourcefrog commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

143mailliw commented Jan 17, 2026

Uh oh!

sourcefrog commented Feb 9, 2026

Uh oh!

sourcefrog commented Feb 9, 2026

Uh oh!

MrSubidubi commented Feb 9, 2026

Uh oh!

sourcefrog commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

sourcefrog commented Mar 7, 2026

Uh oh!

Veykril left a comment

Choose a reason for hiding this comment

Uh oh!

Veykril commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sourcefrog commented Dec 18, 2025 •

edited

Loading

localcc commented Dec 18, 2025 •

edited

Loading

sourcefrog commented Jan 8, 2026 •

edited

Loading