Skip to content

Cumulative metric for active main thread usage#2931

Merged
zuiderkwast merged 15 commits into
valkey-io:unstablefrom
dvkashapov:main-thread-cpu
Jan 16, 2026
Merged

Cumulative metric for active main thread usage#2931
zuiderkwast merged 15 commits into
valkey-io:unstablefrom
dvkashapov:main-thread-cpu

Conversation

@dvkashapov

@dvkashapov dvkashapov commented Dec 13, 2025

Copy link
Copy Markdown
Member

The metric tracks the number of seconds the main thread spends doing active work as supposed to waiting for work. The active time is exposed as the INFO field used_active_time_main_thread in the CPU section.

When I/O threads are used, the main thread uses a busy-loop when it's waiting for work, so the reported CPU usage is near 100% even if the thread has capacity to handle more work. This new metric attempts to provide a useful metric of how loaded the main thread is by excluding the time the thread is just waiting for work.

The busy loop consists of the cycle (beforeSleep, non-blocking epoll, afterSleep). Only the duration of the loops that handle at least one event loop event (network, file or timer event) or some work from I/O threads (execution of commands received from clients by I/O threads) is counted as active time.

Implements the main thread part of #2065.

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@dvkashapov

Copy link
Copy Markdown
Member Author

This address #2065 part for tracking the main thread utilization

@dvkashapov dvkashapov moved this to Todo in Valkey 9.1 Dec 13, 2025
@codecov

codecov Bot commented Dec 13, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.38%. Comparing base (33a1b51) to head (8de3456).
⚠️ Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2931      +/-   ##
============================================
+ Coverage     74.16%   74.38%   +0.22%     
============================================
  Files           129      129              
  Lines         70988    71007      +19     
============================================
+ Hits          52649    52822     +173     
+ Misses        18339    18185     -154     
Files with missing lines Coverage Δ
src/expire.c 97.31% <100.00%> (+<0.01%) ⬆️
src/server.c 89.59% <100.00%> (+0.14%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ranshid

ranshid commented Dec 14, 2025

Copy link
Copy Markdown
Member

@dvkashapov how much of a difference is this compared to using the serverCron taking periodic measurements of CLOCK_THREAD_CPUTIME_ID?

@dvkashapov

dvkashapov commented Dec 14, 2025

Copy link
Copy Markdown
Member Author

how much of a difference is this compared to using the serverCron taking periodic measurements of CLOCK_THREAD_CPUTIME_ID?

Compared locally with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts), diff_perc = busy_time_perc - thread_cputime_perc was from 0 to -7%, as load approaches 100% diff is getting smaller which kinda makes sense.

I think the main benefit of this approach is close to 0 overhead, compared to clock_gettime() WDYT?

@ranshid

ranshid commented Dec 15, 2025

Copy link
Copy Markdown
Member

how much of a difference is this compared to using the serverCron taking periodic measurements of CLOCK_THREAD_CPUTIME_ID?

Compared locally with clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts), diff_perc = busy_time_perc - thread_cputime_perc was from 0 to -7%, as load approaches 100% diff is getting smaller which kinda makes sense.

I think the main benefit of this approach is close to 0 overhead, compared to clock_gettime() WDYT?

Not sure how you compared the 2 options. I was mainly pointing out that having a single cron point where we calculate and update the engine CPU time, might simplify this feature (maybe at a slight loss of precision)

@dvkashapov

dvkashapov commented Dec 15, 2025

Copy link
Copy Markdown
Member Author

I was mainly pointing out that having a single cron point where we calculate and update the engine CPU time, might simplify this feature (maybe at a slight loss of precision)

Ah, I thought that you were concerned about accuracy of percentage value of this approach.
Both are probably the same in terms of complexity, but this one is just lightweight and uses everything we already have.

@zuiderkwast

zuiderkwast commented Dec 15, 2025

Copy link
Copy Markdown
Contributor

The metric tracks percentage of time the main thread spends in busy event loop cycles (those with file events, IO responses, or client writes), exposed as main_thread_utilization_perc in INFO CPU section.

It's a percentage of the total time since the server started?

We need to be able to observe changes over time. The current metrics show server uptime in seconds and CPU time in seconds (used_cpu_user_main_thread). To see this over time, you would call INFO CPU every "monitoring period" (e.g. one minute) and compare the current to the previous number. With this information, you can calculate the CPU usage per monitoring period.

I think we should expose the CPU utilization field in the same way, i.e. as number of useful seconds since server start, as suggested in the issue #2065.

Here's an example:

Time 1:

used_cpu_user_main_thread:100.000000
busy_cpu_user_main_thread:50.000000

Time 2:

used_cpu_user_main_thread:160.000000
busy_cpu_user_main_thread:55.000000

With these two samples, we can calculate the delta between time 1 and 2. The delta for used CPU time is 50 seconds and for busy time it's 5 seconds, so for this period, the CPU utilization is 10%. It's possible to draw a graph over time and we don't need to hard-code the sampling interval in the server.

I don't know if "busy" or "useful" is a better prefix. Maybe "useful" (suggested in the issue) is too similar to the exiting fields prefix "used", so maybe busy is better. 🤔

@zuiderkwast

Copy link
Copy Markdown
Contributor

I see now that you implemented it as an instantaneous metric. It works by calculating the utilization over the last 1.6 seconds. (It makes a sample every 100 milliseconds, it has a ring of 16 samples and each sample overwrite the oldest one.) IMO this kind of metrics are less useful for monitoring the system. If you monitor a system with an INFO call every minute or every 5 minutes, you can easily miss a spike of a few seconds high CPU utilization, etc.

@dvkashapov

Copy link
Copy Markdown
Member Author

Yes, exactly, I'll expose cumulative metric then

@zuiderkwast

zuiderkwast commented Dec 15, 2025

Copy link
Copy Markdown
Contributor

"busy" can be confusing because it sounds like busy-waiting, which is the non-useful time. We are interested in the useful time.

New suggestion: "active"

We can use it as

  1. a prefix as in active_cpu_user_main_thread or
  2. instead of the sys/user part as in used_cpu_active_main_thread

If we pick the latter, we'll have these three which looks pretty nice:

used_cpu_sys_main_thread:123.123456
used_cpu_user_main_thread:567.123456
used_cpu_active_main_thread:345.123456

…ain-thread-cpu

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Since we measure the duration from the beginning of afterSleep to the end of beforeSleep, we automatically exclude any time we're "sleeping" in a blocking epoll, which is good.

I added a few comments.

Comment thread src/server.h Outdated
Comment thread src/server.c
Comment thread src/server.c Outdated
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@dvkashapov

Copy link
Copy Markdown
Member Author

Clang Format failed at setup, not on check itself, checked locally - everything should be OK

@dvkashapov dvkashapov changed the title Metric for main thread CPU utilization in percentages Cumulative metric for active main thread usage Dec 19, 2025

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation and INFO fielda look good. My only thoughts now is if we cover all active work or if we should consider something more.

What about this in beforeSleep, is it active work?

    /* Run a fast expire cycle (the called function will return
     * ASAP if a fast cycle is not needed). */
    if (server.active_expire_enabled && !server.import_mode && iAmPrimary()) activeExpireCycle(ACTIVE_EXPIRE_CYCLE_FAST);
    if (moduleCount()) {
        moduleFireServerEvent(VALKEYMODULE_EVENT_EVENTLOOP, VALKEYMODULE_SUBEVENT_EVENTLOOP_BEFORE_SLEEP, NULL);
    }

If we want to count it as active, we could take the monotime before and after this...

Comment thread src/server.c Outdated
Comment thread src/server.c Outdated
@dvkashapov

Copy link
Copy Markdown
Member Author

activeExpireCycle()

Makes sense to mark this as active work but should we check that this expire cycle took more than some noticeable amount of time, so that we won't accidentally count all loops to be active. What about 250-500 us?

@zuiderkwast

Copy link
Copy Markdown
Contributor

activeExpireCycle()

Makes sense to mark this as active work but should we check that this expire cycle took more than some noticeable amount of time, so that we won't accidentally count all loops to be active. What about 250-500 us?

A threshold is not perfect. I have an idea: This function internally already keeps track of the elapsed time spent on active expiration. We could change the function so it returns the elapsed time and count it as active CPU time even during an inactive iteration. Does it make sense?

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@dvkashapov

Copy link
Copy Markdown
Member Author

I have an idea: This function internally already keeps track of the elapsed time spent on active expiration. We could change the function so it returns the elapsed time and count it as active CPU time even during an inactive iteration. Does it make sense?

Definitely makes sense, I added a comment to indicate that we count expiry time as active CPU time for all event loops.

@ranshid ranshid self-requested a review January 1, 2026 17:27
@zuiderkwast

zuiderkwast commented Jan 2, 2026

Copy link
Copy Markdown
Contributor

@ranshid I'm not sure I understand how your suggestion can solve this.

If we're asking the clock or the kernel about CPU time, it will return 100% CPU usage (or close to) because when IO threads are used, the main thread does a busy loop with non-blocking epoll, i.e. it busy-loops beforeSleep, epoll, afterSleep.

The purpose of the metric is to know what margin the server has, i.e. how much more work the thread can handle, which we get if we excude the CPU time that consists of only busy-looping.

Am I missing anything?

@ranshid

ranshid commented Jan 3, 2026

Copy link
Copy Markdown
Member

@ranshid I'm not sure I understand how your suggestion can solve this.

If we're asking the clock or the kernel about CPU time, it will return 100% CPU usage (or close to) because when IO threads are used, the main thread does a busy loop with non-blocking epoll, i.e. it busy-loops beforeSleep, epoll, afterSleep.

The purpose of the metric is to know what margin the server has, i.e. how much more work the thread can handle, which we get if we excude the CPU time that consists of only busy-looping.

Am I missing anything?

Oh @zuiderkwast thank you for putting me back on the relevant motivation :) for some reason I only thought you wanted to provide a percentage calculation from the server (which has it's own benefits). I guess I am fine with that direction then.

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major decision approved here, just change the two usages of monotonic clocks.

Comment thread src/server.c Outdated
Comment thread src/server.c Outdated
@madolson madolson added the major-decision-approved Major decision approved by TSC team label Jan 12, 2026
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Comment thread src/expire.c Outdated
Comment thread src/server.c Outdated
Comment thread src/server.c
Comment thread tests/unit/info-command.tcl Outdated
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks very good now.

I checked how the monotonic clock works in src/monotonic.c and I see that we use clock_gettime(CLOCK_MONOTONIC, &ts); by default. It's a syscall (or vDSO on Linux but a real syscall on other OSes) so I think we should really start using processor clock without syscall now in the same version, so this metric doesn't add a lot of new syscalls.

Comment thread src/server.c Outdated
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

Update. We have been chatting about this in other channels. Thanks also for feedback from @JimB123.

We should measure the times intervals using clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts) instead of monotonic wall clock time, because:

  • CPU time excludes time that the thread is not allowed to run because of preemption.
  • CPU time excludes time spent on blocking calls to epoll, disk IO, etc.
  • used_cpu_active_main_thread is not expected to be higher than the CPU time reported in used_cpu_user_main_thread. That would be quite confusing. It's more expected that the "active" time is always equal to or lower than used_cpu_user_main_thread.
  • The name of the field used_cpu_active_main_thread and the INFO section # CPU hint that this metric is all about CPU time, minus the busy spinning.

I believe we can get away with only a single clock_gettime call per event loop iteration. If we do it only in beforeSleep, we can use the delta between one and the next. The delta then includes the exact total CPU time of the cycle.

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

Update again. I've been talking to various people about measuring CPU time vs monotonic time.

Going back to the purpose of this metric. The purpose is to see how much spare capacity the thread has to handle more work.

If we measure CPU time and it shows 90% utilization, it doesn't necessarily mean that the thread has capacity to handle more work. The remaining 10% can be that the thread is waiting, not for more work but waiting for other reasons:

  • preemtion, i.e. the thread is not allowed to run full time
  • waiting for blocking IO or syscalls (CPU time spent by kernel, not userspace)

On the other hand, if we measure these active work time in monotonic time and the metric shows 90% of the time the thread is working, it means that the remaining 10% of the time, it is waiting for work. Thus, measuring monotonic time is more useful for this metric.

We'll keep measuring active time using monotonic clock and we'll change the name of the INFO field so that it doesn't include "cpu". New name of the field is used_active_time_main_thread. (Ack by @madolson.)

@zuiderkwast zuiderkwast merged commit 4524c23 into valkey-io:unstable Jan 16, 2026
36 of 37 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in Valkey 9.1 Jan 16, 2026
@dvkashapov dvkashapov deleted the main-thread-cpu branch January 16, 2026 10:56
@dvkashapov dvkashapov added needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes labels Jan 16, 2026
arshidkv12 pushed a commit to arshidkv12/valkey that referenced this pull request Jan 23, 2026
The metric tracks the number of seconds the main thread spends doing
active work as supposed to waiting for work. The active time is exposed
as the INFO field `used_active_time_main_thread` in the CPU section.

When I/O threads are used, the main thread uses a busy-loop when it's
waiting for work, so the reported CPU usage is near 100% even if the
thread has capacity to handle more work. This new metric attempts to
provide a useful metric of how loaded the main thread is by excluding
the time the thread is just waiting for work.

The busy loop consists of the cycle (beforeSleep, non-blocking epoll,
afterSleep). Only the duration of the loops that handle at least one
event loop event (network, file or timer event) or some work from I/O
threads (execution of commands received from clients by I/O threads) is
counted as active time.

Implements the main thread part of valkey-io#2065.

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: arshidkv12 <arshidkv12@gmail.com>
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
The metric tracks the number of seconds the main thread spends doing
active work as supposed to waiting for work. The active time is exposed
as the INFO field `used_active_time_main_thread` in the CPU section.

When I/O threads are used, the main thread uses a busy-loop when it's
waiting for work, so the reported CPU usage is near 100% even if the
thread has capacity to handle more work. This new metric attempts to
provide a useful metric of how loaded the main thread is by excluding
the time the thread is just waiting for work.

The busy loop consists of the cycle (beforeSleep, non-blocking epoll,
afterSleep). Only the duration of the loops that handle at least one
event loop event (network, file or timer event) or some work from I/O
threads (execution of commands received from clients by I/O threads) is
counted as active time.

Implements the main thread part of valkey-io#2065.

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Mar 5, 2026
The metric tracks the number of seconds the main thread spends doing
active work as supposed to waiting for work. The active time is exposed
as the INFO field `used_active_time_main_thread` in the CPU section.

When I/O threads are used, the main thread uses a busy-loop when it's
waiting for work, so the reported CPU usage is near 100% even if the
thread has capacity to handle more work. This new metric attempts to
provide a useful metric of how loaded the main thread is by excluding
the time the thread is just waiting for work.

The busy loop consists of the cycle (beforeSleep, non-blocking epoll,
afterSleep). Only the duration of the loops that handle at least one
event loop event (network, file or timer event) or some work from I/O
threads (execution of commands received from clients by I/O threads) is
counted as active time.

Implements the main thread part of valkey-io#2065.

---------

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
zuiderkwast added a commit to valkey-io/valkey-doc that referenced this pull request May 8, 2026
Documents the following features:

* valkey-io/valkey#2931
* valkey-io/valkey#2463

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants